{"id":7514,"date":"2026-03-25T09:08:03","date_gmt":"2026-03-25T09:08:03","guid":{"rendered":"https:\/\/lite16.com\/blog\/?p=7514"},"modified":"2026-03-25T09:08:03","modified_gmt":"2026-03-25T09:08:03","slug":"reinforcement-learning","status":"publish","type":"post","link":"https:\/\/lite16.com\/blog\/2026\/03\/25\/reinforcement-learning\/","title":{"rendered":"Reinforcement Learning"},"content":{"rendered":"<h2 data-start=\"84\" data-end=\"125\">Introduction<\/h2>\n<p data-start=\"127\" data-end=\"644\">Reinforcement Learning (RL) is a dynamic area of machine learning that focuses on how agents should take actions in an environment to maximize cumulative rewards. Unlike supervised learning, where models are trained on labeled datasets, RL involves learning from interaction, trial-and-error, and delayed feedback. It draws inspiration from behavioral psychology, where learning is driven by rewards and punishments, making it particularly suitable for tasks where explicit instruction or labeled data is unavailable.<\/p>\n<h3 data-start=\"646\" data-end=\"670\">Fundamental Concepts<\/h3>\n<p data-start=\"672\" data-end=\"1373\">At the heart of reinforcement learning are <strong data-start=\"715\" data-end=\"725\">agents<\/strong>, <strong data-start=\"727\" data-end=\"743\">environments<\/strong>, <strong data-start=\"745\" data-end=\"755\">states<\/strong>, <strong data-start=\"757\" data-end=\"768\">actions<\/strong>, and <strong data-start=\"774\" data-end=\"785\">rewards<\/strong>. An <strong data-start=\"790\" data-end=\"799\">agent<\/strong> is the learner or decision-maker, while the <strong data-start=\"844\" data-end=\"859\">environment<\/strong> is everything external with which the agent interacts. The <strong data-start=\"919\" data-end=\"928\">state<\/strong> is a representation of the current situation in the environment, capturing all necessary information for decision-making. <strong data-start=\"1051\" data-end=\"1062\">Actions<\/strong> are choices the agent can make to influence the environment, and <strong data-start=\"1128\" data-end=\"1139\">rewards<\/strong> are signals that evaluate the effectiveness of these actions. The ultimate goal of an RL agent is to learn a <strong data-start=\"1249\" data-end=\"1259\">policy<\/strong>, a strategy that maps states to actions to maximize the long-term cumulative reward, often called the <strong data-start=\"1362\" data-end=\"1372\">return<\/strong>.<\/p>\n<p data-start=\"1375\" data-end=\"1964\">Mathematically, RL problems are commonly formulated as a <strong data-start=\"1432\" data-end=\"1465\">Markov Decision Process (MDP)<\/strong>. An MDP is defined by a tuple <span class=\"katex\"><span class=\"katex-mathml\">(S,A,P,R,\u03b3)(S, A, P, R, \\gamma)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">S<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">A<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">P<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">R<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>, where <span class=\"katex\"><span class=\"katex-mathml\">SS<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">S<\/span><\/span><\/span><\/span> is a set of states, <span class=\"katex\"><span class=\"katex-mathml\">AA<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">A<\/span><\/span><\/span><\/span> is a set of actions, <span class=\"katex\"><span class=\"katex-mathml\">PP<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">P<\/span><\/span><\/span><\/span> is the state transition probability function, <span class=\"katex\"><span class=\"katex-mathml\">RR<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">R<\/span><\/span><\/span><\/span> is the reward function, and <span class=\"katex\"><span class=\"katex-mathml\">\u03b3\\gamma<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><\/span><\/span><\/span> is the discount factor. The discount factor balances the importance of immediate versus future rewards, typically taking a value between 0 and 1. A higher <span class=\"katex\"><span class=\"katex-mathml\">\u03b3\\gamma<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><\/span><\/span><\/span> values long-term rewards, encouraging the agent to plan ahead, while a lower <span class=\"katex\"><span class=\"katex-mathml\">\u03b3\\gamma<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><\/span><\/span><\/span> focuses more on immediate gains.<\/p>\n<h3 data-start=\"1966\" data-end=\"1998\">Exploration vs. Exploitation<\/h3>\n<p data-start=\"2000\" data-end=\"2534\">One of the central challenges in RL is the <strong data-start=\"2043\" data-end=\"2079\">exploration-exploitation dilemma<\/strong>. An agent must balance <strong data-start=\"2103\" data-end=\"2117\">exploiting<\/strong> known strategies that yield high rewards with <strong data-start=\"2164\" data-end=\"2177\">exploring<\/strong> new actions that might result in even higher long-term gains. Too much exploitation may cause the agent to settle prematurely on suboptimal strategies, while excessive exploration can slow learning and reduce efficiency. Effective RL algorithms implement mechanisms, such as epsilon-greedy or softmax action selection, to maintain this balance dynamically.<\/p>\n<h3 data-start=\"2536\" data-end=\"2576\">Value-Based and Policy-Based Methods<\/h3>\n<p data-start=\"2578\" data-end=\"2706\">Reinforcement learning methods are broadly categorized into <strong data-start=\"2638\" data-end=\"2653\">value-based<\/strong>, <strong data-start=\"2655\" data-end=\"2671\">policy-based<\/strong>, and <strong data-start=\"2677\" data-end=\"2693\">actor-critic<\/strong> approaches.<\/p>\n<ul data-start=\"2708\" data-end=\"4041\">\n<li data-start=\"2708\" data-end=\"3247\"><strong data-start=\"2710\" data-end=\"2733\">Value-based methods<\/strong> focus on estimating the value of states or state-action pairs. The <strong data-start=\"2801\" data-end=\"2819\">value function<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">V(s)V(s)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">V<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> represents the expected return starting from state <span class=\"katex\"><span class=\"katex-mathml\">ss<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">s<\/span><\/span><\/span><\/span>, while the <strong data-start=\"2897\" data-end=\"2922\">action-value function<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">Q(s,a)Q(s, a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> represents the expected return for taking action <span class=\"katex\"><span class=\"katex-mathml\">aa<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">a<\/span><\/span><\/span><\/span> in state <span class=\"katex\"><span class=\"katex-mathml\">ss<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">s<\/span><\/span><\/span><\/span>. One of the most popular algorithms in this category is <strong data-start=\"3061\" data-end=\"3075\">Q-learning<\/strong>, which iteratively updates estimates of <span class=\"katex\"><span class=\"katex-mathml\">QQ<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><\/span><\/span><\/span>-values using the Bellman equation. Value-based methods are particularly useful when the state and action spaces are discrete.<\/li>\n<li data-start=\"3249\" data-end=\"3630\"><strong data-start=\"3251\" data-end=\"3275\">Policy-based methods<\/strong> directly parameterize and optimize the policy without explicitly estimating value functions. These methods are advantageous in environments with continuous action spaces and can learn stochastic policies, which are beneficial for exploration. <strong data-start=\"3519\" data-end=\"3538\">Policy Gradient<\/strong> algorithms, such as REINFORCE, use gradient ascent to maximize expected cumulative rewards.<\/li>\n<li data-start=\"3632\" data-end=\"4041\"><strong data-start=\"3634\" data-end=\"3658\">Actor-Critic methods<\/strong> combine the strengths of value-based and policy-based approaches. The <strong data-start=\"3729\" data-end=\"3738\">actor<\/strong> updates the policy directly, while the <strong data-start=\"3778\" data-end=\"3788\">critic<\/strong> estimates the value function, guiding the actor to improve its actions. Techniques such as <strong data-start=\"3880\" data-end=\"3912\">Advantage Actor-Critic (A2C)<\/strong> and <strong data-start=\"3917\" data-end=\"3962\">Deep Deterministic Policy Gradient (DDPG)<\/strong> leverage this hybrid framework, achieving remarkable success in complex tasks.<\/li>\n<\/ul>\n<h3 data-start=\"4043\" data-end=\"4074\">Deep Reinforcement Learning<\/h3>\n<p data-start=\"4076\" data-end=\"4646\">The integration of <strong data-start=\"4095\" data-end=\"4112\">deep learning<\/strong> with RL, known as <strong data-start=\"4131\" data-end=\"4172\">Deep Reinforcement Learning (Deep RL)<\/strong>, has dramatically expanded the capability of RL agents. Deep neural networks serve as function approximators for value functions, policies, or both, allowing RL agents to handle high-dimensional state spaces such as images or sensor data. Landmark successes, such as <strong data-start=\"4440\" data-end=\"4465\">Deep Q-Networks (DQN)<\/strong> achieving human-level performance in Atari games and <strong data-start=\"4519\" data-end=\"4530\">AlphaGo<\/strong> defeating professional Go players, demonstrate the potential of Deep RL in solving previously intractable problems.<\/p>\n<p data-start=\"4648\" data-end=\"5022\">Despite its promise, Deep RL presents challenges. Training is often data-intensive and unstable due to non-stationary targets and correlated data. Techniques like experience replay, target networks, and reward normalization are used to stabilize learning. Moreover, designing appropriate reward functions and managing sparse or delayed rewards remain ongoing research areas.<\/p>\n<h3 data-start=\"5024\" data-end=\"5066\">Applications of Reinforcement Learning<\/h3>\n<p data-start=\"5068\" data-end=\"5749\">Reinforcement learning has wide-ranging applications across industries. In <strong data-start=\"5143\" data-end=\"5155\">robotics<\/strong>, RL enables autonomous robots to learn complex manipulation tasks, locomotion, and navigation. In <strong data-start=\"5254\" data-end=\"5277\">autonomous vehicles<\/strong>, RL is used for decision-making, lane changing, and path planning. In the <strong data-start=\"5352\" data-end=\"5372\">financial sector<\/strong>, RL models optimize trading strategies and portfolio management. Gaming and entertainment also benefit from RL, with agents mastering video games and simulating intelligent non-player characters. Additionally, RL finds applications in healthcare for treatment planning and personalized interventions, as well as in operations research for inventory and logistics optimization.<\/p>\n<h2 data-start=\"189\" data-end=\"239\">Historical Background of Reinforcement Learning<\/h2>\n<p data-start=\"241\" data-end=\"810\">Reinforcement Learning (RL) is a subfield of machine learning concerned with how agents should take actions in an environment to maximize cumulative rewards. Unlike supervised learning, which relies on labeled input-output pairs, RL involves learning from interaction, trial-and-error, and delayed feedback. The historical development of reinforcement learning spans several decades, influenced by psychology, neuroscience, operations research, and computer science. Understanding this background is essential to appreciate the principles and modern advancements in RL.<\/p>\n<h3 data-start=\"812\" data-end=\"853\">Origins in Psychology and Behaviorism<\/h3>\n<p data-start=\"855\" data-end=\"1441\">The conceptual roots of reinforcement learning can be traced back to early psychological research on learning and behavior. Behaviorists, particularly <strong data-start=\"1006\" data-end=\"1022\">B.F. Skinner<\/strong> in the 1930s and 1940s, formalized the idea of operant conditioning. Skinner\u2019s experiments with animals, especially rats and pigeons, demonstrated that behavior could be shaped by reinforcement or punishment. The fundamental principle was that actions followed by positive outcomes (reinforcements) were more likely to be repeated, while actions followed by negative outcomes (punishments) were less likely to occur.<\/p>\n<p data-start=\"1443\" data-end=\"2033\">These behavioral experiments introduced the essential ideas of <strong data-start=\"1506\" data-end=\"1534\">trial-and-error learning<\/strong>, <strong data-start=\"1536\" data-end=\"1561\">reward-based feedback<\/strong>, and <strong data-start=\"1567\" data-end=\"1587\">action selection<\/strong>, which later became central to reinforcement learning algorithms. Psychologists such as <strong data-start=\"1676\" data-end=\"1696\">Edward Thorndike<\/strong>, with his law of effect, further influenced this conceptual foundation, emphasizing that behaviors leading to satisfying outcomes are more likely to be strengthened. Although these studies were biological and experimental in nature, they provided an abstract framework that computational scientists would later formalize mathematically.<\/p>\n<h3 data-start=\"2035\" data-end=\"2089\">Early Computational Models and Dynamic Programming<\/h3>\n<p data-start=\"2091\" data-end=\"2801\">The transition from psychology to computational modeling began in the 1950s and 1960s. Researchers in artificial intelligence (AI) and operations research explored methods for sequential decision-making under uncertainty. Early work by <strong data-start=\"2327\" data-end=\"2346\">Richard Bellman<\/strong> in the 1950s introduced <strong data-start=\"2371\" data-end=\"2394\">dynamic programming<\/strong>, a method for solving optimization problems by breaking them into smaller subproblems. Bellman formulated the <strong data-start=\"2505\" data-end=\"2525\">Bellman equation<\/strong>, a recursive representation of value functions, which later became a cornerstone of reinforcement learning theory. Dynamic programming required complete knowledge of the environment&#8217;s model and was initially used to solve problems in control theory and operations research.<\/p>\n<p data-start=\"2803\" data-end=\"3222\">Around the same time, <strong data-start=\"2825\" data-end=\"2842\">Marvin Minsky<\/strong> and other AI pioneers began exploring learning machines. The idea of creating algorithms that could adapt their behavior based on experience, rather than being hard-coded, was gaining momentum. Early AI research focused on symbolic reasoning, planning, and game playing, but there was a growing interest in learning from interaction with an environment, setting the stage for RL.<\/p>\n<h3 data-start=\"3224\" data-end=\"3282\">Temporal-Difference Learning and the 1980s Renaissance<\/h3>\n<p data-start=\"3284\" data-end=\"3741\">A major turning point in RL history occurred in the 1980s with the introduction of <strong data-start=\"3367\" data-end=\"3404\">Temporal-Difference (TD) learning<\/strong>. Richard Sutton, in 1988, formalized TD learning as a method to learn predictions and value functions directly from experience, without requiring a complete model of the environment. TD learning combines ideas from dynamic programming and Monte Carlo methods, enabling agents to update estimates based on partially observed sequences.<\/p>\n<p data-start=\"3743\" data-end=\"4099\">Sutton\u2019s TD(\u03bb) algorithm became a foundational method in reinforcement learning, bridging the gap between theoretical models and practical algorithms. Around the same time, researchers explored <strong data-start=\"3937\" data-end=\"3967\">actor-critic architectures<\/strong>, which separate policy (action selection) from value function evaluation, further enhancing the ability to learn complex behaviors.<\/p>\n<p data-start=\"4101\" data-end=\"4526\">Parallel research in neuroscience suggested that the brain might implement a form of reinforcement learning. Studies of the <strong data-start=\"4225\" data-end=\"4259\">dopaminergic system in mammals<\/strong> indicated that neural signals resembling temporal-difference errors could explain reward-based learning in animals. This biological insight strengthened the theoretical plausibility of reinforcement learning algorithms and inspired biologically motivated approaches.<\/p>\n<h3 data-start=\"4528\" data-end=\"4583\">Integration with Neural Networks: Early Experiments<\/h3>\n<p data-start=\"4585\" data-end=\"4984\">In the late 1980s and 1990s, researchers attempted to combine reinforcement learning with <strong data-start=\"4675\" data-end=\"4705\">artificial neural networks<\/strong> to address the challenge of learning in high-dimensional or continuous state spaces. Early successes included <strong data-start=\"4816\" data-end=\"4853\">Samuel\u2019s checkers-playing program<\/strong>, which used a form of temporal-difference learning with a linear function approximator to improve performance through self-play.<\/p>\n<p data-start=\"4986\" data-end=\"5504\">Neural networks enabled generalization across states, but early methods suffered from instability and slow convergence. Researchers such as <strong data-start=\"5126\" data-end=\"5137\">Tesauro<\/strong>, with his <strong data-start=\"5148\" data-end=\"5161\">TD-Gammon<\/strong> program in the early 1990s, demonstrated the power of combining reinforcement learning with neural networks. TD-Gammon learned to play backgammon at a world-class level, solely through self-play and temporal-difference learning. This achievement marked a milestone, showing that RL could handle complex tasks without explicit human knowledge.<\/p>\n<h3 data-start=\"5506\" data-end=\"5541\">Q-Learning and Policy Iteration<\/h3>\n<p data-start=\"5543\" data-end=\"5997\">Another key development in the evolution of reinforcement learning was <strong data-start=\"5614\" data-end=\"5628\">Q-learning<\/strong>, introduced by <strong data-start=\"5644\" data-end=\"5661\">Chris Watkins<\/strong> in 1989. Q-learning is a model-free RL algorithm that allows an agent to learn an optimal policy directly from experience, without requiring knowledge of the environment&#8217;s dynamics. The algorithm maintains a Q-value table, representing the expected cumulative reward for each state-action pair, and updates it using observed rewards.<\/p>\n<p data-start=\"5999\" data-end=\"6478\">Q-learning and related <strong data-start=\"6022\" data-end=\"6042\">policy iteration<\/strong> and <strong data-start=\"6047\" data-end=\"6066\">value iteration<\/strong> methods provided robust mathematical foundations for RL and became central to modern RL research. They enabled applications ranging from robotics to control systems, even in stochastic and uncertain environments. These methods highlighted the importance of balancing <strong data-start=\"6334\" data-end=\"6349\">exploration<\/strong> (trying new actions) and <strong data-start=\"6375\" data-end=\"6391\">exploitation<\/strong> (choosing the best-known actions) \u2014 a fundamental trade-off in reinforcement learning.<\/p>\n<h3 data-start=\"6480\" data-end=\"6531\">Modern Advances and Deep Reinforcement Learning<\/h3>\n<p data-start=\"6533\" data-end=\"6968\">While classical reinforcement learning techniques flourished in the 1990s and early 2000s, scaling them to high-dimensional tasks remained challenging. The breakthrough came with the integration of <strong data-start=\"6731\" data-end=\"6748\">deep learning<\/strong> and reinforcement learning in the 2010s. Deep RL uses deep neural networks as function approximators for value functions or policies, enabling agents to learn directly from raw sensory inputs, such as images or audio.<\/p>\n<p data-start=\"6970\" data-end=\"7492\">A landmark achievement was <strong data-start=\"6997\" data-end=\"7032\">DeepMind\u2019s Deep Q-Network (DQN)<\/strong>, introduced in 2015, which learned to play Atari games at superhuman levels using raw pixel inputs and Q-learning. This success demonstrated that reinforcement learning could now tackle complex, high-dimensional tasks in dynamic environments. Subsequent advancements in <strong data-start=\"7303\" data-end=\"7330\">policy gradient methods<\/strong>, <strong data-start=\"7332\" data-end=\"7359\">actor-critic algorithms<\/strong>, and <strong data-start=\"7365\" data-end=\"7383\">model-based RL<\/strong> further expanded the scope of applications, including robotics, autonomous driving, healthcare, and finance.<\/p>\n<h2 data-start=\"125\" data-end=\"163\">Evolution of Reinforcement Learning<\/h2>\n<p data-start=\"165\" data-end=\"676\">Reinforcement Learning (RL) is one of the most influential paradigms in artificial intelligence, enabling agents to learn optimal behavior through interaction with an environment. Its evolution spans multiple decades, combining insights from psychology, neuroscience, operations research, and computer science. The development of RL can be traced through several key phases: from foundational psychological theories to modern deep reinforcement learning approaches that drive cutting-edge AI applications today.<\/p>\n<h3 data-start=\"678\" data-end=\"724\">Early Foundations in Behavioral Psychology<\/h3>\n<p data-start=\"726\" data-end=\"1316\">The origins of reinforcement learning are deeply rooted in behavioral psychology. In the 1920s and 1930s, researchers like <strong data-start=\"849\" data-end=\"869\">Edward Thorndike<\/strong> and <strong data-start=\"874\" data-end=\"890\">B.F. Skinner<\/strong> pioneered the study of learning through consequences. Thorndike\u2019s <strong data-start=\"957\" data-end=\"974\">Law of Effect<\/strong> posited that behaviors followed by satisfying outcomes are likely to be repeated, while behaviors followed by negative outcomes are less likely to occur. Skinner expanded on this with his work in <strong data-start=\"1171\" data-end=\"1195\">operant conditioning<\/strong>, demonstrating how rewards and punishments could shape behavior in animals through systematic reinforcement schedules.<\/p>\n<p data-start=\"1318\" data-end=\"1693\">These early behavioral studies introduced fundamental concepts that remain central to RL: learning from interaction, delayed feedback, and trial-and-error exploration. The psychological framework provided the conceptual basis for later computational models, suggesting that intelligent behavior could emerge from iterative learning processes rather than explicit instruction.<\/p>\n<h3 data-start=\"1695\" data-end=\"1751\">Formalization in Mathematics and Operations Research<\/h3>\n<p data-start=\"1753\" data-end=\"2184\">By the 1950s and 1960s, researchers sought to formalize these behavioral concepts using mathematical models. <strong data-start=\"1862\" data-end=\"1881\">Richard Bellman<\/strong> introduced <strong data-start=\"1893\" data-end=\"1921\">dynamic programming (DP)<\/strong>, a method to solve sequential decision-making problems through recursive decomposition. The <strong data-start=\"2014\" data-end=\"2034\">Bellman equation<\/strong> allowed for the computation of optimal value functions and policies, providing a rigorous mathematical foundation for reinforcement-based learning.<\/p>\n<p data-start=\"2186\" data-end=\"2667\">At the same time, operations research and control theory explored stochastic optimization, Markov decision processes (MDPs), and sequential decision-making under uncertainty. The MDP framework, formalized by <strong data-start=\"2394\" data-end=\"2411\">Andrey Markov<\/strong>, provided a structure in which an agent interacts with an environment described by states, actions, transition probabilities, and rewards. Early research primarily relied on <strong data-start=\"2586\" data-end=\"2609\">model-based methods<\/strong>, where complete knowledge of the environment was assumed.<\/p>\n<h3 data-start=\"2669\" data-end=\"2728\">Temporal-Difference Learning and Early Computational RL<\/h3>\n<p data-start=\"2730\" data-end=\"3126\">The 1980s marked a major milestone in RL with the introduction of <strong data-start=\"2796\" data-end=\"2833\">temporal-difference (TD) learning<\/strong>, a method that allowed agents to learn directly from raw experience without requiring a full model of the environment. Proposed by <strong data-start=\"2965\" data-end=\"2983\">Richard Sutton<\/strong> in 1988, TD learning enabled updates to value functions incrementally using the difference between predicted and observed rewards over time.<\/p>\n<p data-start=\"3128\" data-end=\"3607\">TD learning bridged the gap between <strong data-start=\"3164\" data-end=\"3187\">Monte Carlo methods<\/strong>, which rely on complete episodic experience, and <strong data-start=\"3237\" data-end=\"3260\">dynamic programming<\/strong>, which requires a perfect model. It introduced a mechanism to learn from partially observed sequences and became foundational in modern RL. During this period, the <strong data-start=\"3425\" data-end=\"3454\">actor-critic architecture<\/strong> was also explored, allowing for separate learning of policy (actor) and value functions (critic), providing more flexible and scalable approaches to RL.<\/p>\n<h3 data-start=\"3609\" data-end=\"3668\">Model-Free Reinforcement Learning: Q-Learning and SARSA<\/h3>\n<p data-start=\"3670\" data-end=\"4162\">The late 1980s and early 1990s saw the emergence of <strong data-start=\"3722\" data-end=\"3750\">model-free RL algorithms<\/strong>, which did not require explicit knowledge of the environment&#8217;s dynamics. <strong data-start=\"3824\" data-end=\"3838\">Q-learning<\/strong>, introduced by <strong data-start=\"3854\" data-end=\"3871\">Chris Watkins<\/strong> in 1989, became a landmark algorithm. It allowed agents to learn an optimal action-value function by iteratively updating Q-values based on observed rewards and future value estimates. Q-learning is both <strong data-start=\"4076\" data-end=\"4090\">off-policy<\/strong> and robust, making it widely applicable to a variety of environments.<\/p>\n<p data-start=\"4164\" data-end=\"4526\">Simultaneously, the <strong data-start=\"4184\" data-end=\"4203\">SARSA algorithm<\/strong> (State-Action-Reward-State-Action) was introduced, an <strong data-start=\"4258\" data-end=\"4271\">on-policy<\/strong> method that updated action-value estimates using the agent\u2019s actual behavior rather than an optimal policy. These algorithms emphasized the exploration-exploitation trade-off, ensuring agents balance trying new actions with leveraging existing knowledge.<\/p>\n<h3 data-start=\"4528\" data-end=\"4588\">Integration with Neural Networks: Function Approximation<\/h3>\n<p data-start=\"4590\" data-end=\"4993\">As reinforcement learning algorithms matured, researchers faced the challenge of scaling RL to high-dimensional state and action spaces. Early methods relied on tabular representations, which were infeasible for complex environments. During the late 1980s and 1990s, researchers began integrating RL with <strong data-start=\"4895\" data-end=\"4925\">artificial neural networks<\/strong>, enabling function approximation for value functions or policies.<\/p>\n<p data-start=\"4995\" data-end=\"5464\">One of the earliest successes was <strong data-start=\"5029\" data-end=\"5058\">Samuel\u2019s checkers program<\/strong>, which used temporal-difference learning with a linear function approximator to achieve strong performance. Another milestone was <strong data-start=\"5189\" data-end=\"5212\">Tesauro\u2019s TD-Gammon<\/strong>, which combined TD learning with a neural network to play backgammon at a world-class level. These efforts demonstrated that RL could handle complex, continuous state spaces and achieve high-level performance without explicit human-crafted heuristics.<\/p>\n<h3 data-start=\"5466\" data-end=\"5521\">Advances in Exploration and Policy Gradient Methods<\/h3>\n<p data-start=\"5523\" data-end=\"5990\">Traditional value-based methods like Q-learning focused on estimating the expected cumulative reward, but handling large or continuous action spaces remained challenging. In the 1990s and early 2000s, <strong data-start=\"5724\" data-end=\"5748\">policy-based methods<\/strong> gained attention. Policy gradient algorithms directly optimize the agent\u2019s policy through gradient ascent on expected returns. This approach enables smooth, continuous action selection and addresses some limitations of value-based methods.<\/p>\n<p data-start=\"5992\" data-end=\"6318\">The combination of actor-critic architectures with policy gradients allowed for more efficient learning in dynamic and partially observable environments. Techniques like <strong data-start=\"6162\" data-end=\"6184\">eligibility traces<\/strong> further improved credit assignment, ensuring that rewards received at a later stage could influence earlier actions more effectively.<\/p>\n<h3 data-start=\"6320\" data-end=\"6355\">Deep Reinforcement Learning Era<\/h3>\n<p data-start=\"6357\" data-end=\"6638\">The integration of <strong data-start=\"6376\" data-end=\"6393\">deep learning<\/strong> with reinforcement learning in the 2010s marked a revolutionary phase. Deep RL allows agents to learn directly from high-dimensional sensory inputs, such as images, audio, or text, by leveraging deep neural networks as function approximators.<\/p>\n<p data-start=\"6640\" data-end=\"7204\"><strong data-start=\"6640\" data-end=\"6665\">Deep Q-Networks (DQN)<\/strong>, developed by <strong data-start=\"6680\" data-end=\"6692\">DeepMind<\/strong> in 2015, achieved human-level performance on a variety of Atari games using raw pixel inputs. The success of DQN demonstrated that RL could scale to complex, high-dimensional problems without hand-engineered features. Following this, <strong data-start=\"6927\" data-end=\"6954\">policy gradient methods<\/strong>, <strong data-start=\"6956\" data-end=\"6981\">actor-critic networks<\/strong>, and <strong data-start=\"6987\" data-end=\"7030\">trust region policy optimization (TRPO)<\/strong> further enhanced stability and performance. Techniques like <strong data-start=\"7091\" data-end=\"7112\">experience replay<\/strong> and <strong data-start=\"7117\" data-end=\"7136\">target networks<\/strong> were critical innovations to stabilize learning in deep RL systems.<\/p>\n<h3 data-start=\"7206\" data-end=\"7275\">Modern Trends: Model-Based and Multi-Agent Reinforcement Learning<\/h3>\n<p data-start=\"7277\" data-end=\"7603\">Recent years have seen a diversification in reinforcement learning methodologies. <strong data-start=\"7359\" data-end=\"7377\">Model-based RL<\/strong> aims to learn a model of the environment, allowing for planning and improved sample efficiency. Combining model-based and model-free methods has led to significant gains in robotics, autonomous systems, and strategic games.<\/p>\n<p data-start=\"7605\" data-end=\"8109\"><strong data-start=\"7605\" data-end=\"7650\">Multi-agent reinforcement learning (MARL)<\/strong> is another emerging area, where multiple agents learn simultaneously, often in competitive or cooperative settings. This line of research extends RL to real-world scenarios involving social interaction, negotiation, and emergent behavior. Additionally, methods incorporating <strong data-start=\"7926\" data-end=\"7943\">meta-learning<\/strong>, <strong data-start=\"7945\" data-end=\"7968\">curriculum learning<\/strong>, and <strong data-start=\"7974\" data-end=\"8013\">hierarchical reinforcement learning<\/strong> allow agents to transfer knowledge across tasks, learn faster, and solve more complex problems.<\/p>\n<h3 data-start=\"8111\" data-end=\"8145\">Applications Driving Evolution<\/h3>\n<p data-start=\"8147\" data-end=\"8729\">The evolution of reinforcement learning has been fueled by practical applications across domains. Early RL was tested in games such as checkers and backgammon. Later, applications expanded to robotics, autonomous vehicles, resource management, finance, and healthcare. Reinforcement learning\u2019s capacity to optimize sequential decision-making under uncertainty has made it an indispensable tool in modern AI. In particular, successes in <strong data-start=\"8583\" data-end=\"8594\">AlphaGo<\/strong>, <strong data-start=\"8596\" data-end=\"8620\">robotic manipulation<\/strong>, and <strong data-start=\"8626\" data-end=\"8656\">autonomous control systems<\/strong> showcase the power of RL in both simulation and real-world environments.<\/p>\n<h2 data-start=\"141\" data-end=\"199\">Core Concepts and Terminology of Reinforcement Learning<\/h2>\n<p data-start=\"201\" data-end=\"763\">Reinforcement Learning (RL) is a branch of machine learning focused on how agents learn optimal behavior through interaction with an environment to maximize cumulative reward. Unlike supervised learning, where labeled datasets guide learning, RL relies on trial-and-error and feedback from the environment. Understanding the core concepts and terminology is crucial for grasping the mechanisms, algorithms, and applications of RL. This discussion elaborates on the foundational terms, frameworks, and mathematical formulations that define reinforcement learning.<\/p>\n<h3 data-start=\"765\" data-end=\"804\">Agent, Environment, and Interaction<\/h3>\n<p data-start=\"806\" data-end=\"1149\">At the heart of reinforcement learning lies the <strong data-start=\"854\" data-end=\"885\">agent-environment interface<\/strong>. The <strong data-start=\"891\" data-end=\"900\">agent<\/strong> is the learner or decision-maker, responsible for selecting actions to achieve a goal. The <strong data-start=\"992\" data-end=\"1007\">environment<\/strong> encompasses everything external to the agent, including states, rules, and dynamics that determine the consequences of the agent\u2019s actions.<\/p>\n<p data-start=\"1151\" data-end=\"1271\">Interaction between the agent and environment occurs in discrete time steps <span class=\"katex\"><span class=\"katex-mathml\">t=0,1,2,\u2026t = 0, 1, 2, \\dots<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">t<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord\">0<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\">1<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\">2<\/span><span class=\"mpunct\">,<\/span><span class=\"minner\">\u2026<\/span><\/span><\/span><\/span>. At each time step:<\/p>\n<ol data-start=\"1273\" data-end=\"1517\">\n<li data-start=\"1273\" data-end=\"1346\">The agent observes the current <strong data-start=\"1307\" data-end=\"1316\">state<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">sts_t<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> of the environment.<\/li>\n<li data-start=\"1347\" data-end=\"1426\">The agent selects an <strong data-start=\"1371\" data-end=\"1381\">action<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">ata_t<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> based on its <strong data-start=\"1405\" data-end=\"1415\">policy<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">\u03c0\\pi<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><\/span><\/span><\/span>.<\/li>\n<li data-start=\"1427\" data-end=\"1517\">The environment responds with a <strong data-start=\"1462\" data-end=\"1472\">reward<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">rt+1r_{t+1}<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> and a new state <span class=\"katex\"><span class=\"katex-mathml\">st+1s_{t+1}<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>.<\/li>\n<\/ol>\n<p data-start=\"1519\" data-end=\"1688\">This iterative cycle forms the basis for learning. The goal of the agent is to maximize cumulative reward over time, often considering both immediate and future rewards.<\/p>\n<h3 data-start=\"1690\" data-end=\"1719\">State, Action, and Reward<\/h3>\n<p data-start=\"1721\" data-end=\"1857\">The three core elements defining an RL problem are <strong data-start=\"1772\" data-end=\"1781\">state<\/strong>, <strong data-start=\"1783\" data-end=\"1793\">action<\/strong>, and <strong data-start=\"1799\" data-end=\"1809\">reward<\/strong>, commonly referred to as the <strong data-start=\"1839\" data-end=\"1846\">SAR<\/strong> framework.<\/p>\n<ul data-start=\"1859\" data-end=\"2395\">\n<li data-start=\"1859\" data-end=\"2034\"><strong data-start=\"1861\" data-end=\"1880\">State (<span class=\"katex\"><span class=\"katex-mathml\">ss<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">s<\/span><\/span><\/span><\/span>)<\/strong>: A representation of the environment at a given time. States can be discrete (e.g., board positions in chess) or continuous (e.g., robot joint angles).<\/li>\n<li data-start=\"2035\" data-end=\"2220\"><strong data-start=\"2037\" data-end=\"2057\">Action (<span class=\"katex\"><span class=\"katex-mathml\">aa<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">a<\/span><\/span><\/span><\/span>)<\/strong>: The decision or move the agent can make in a given state. Actions can also be discrete (e.g., up\/down\/left\/right) or continuous (e.g., steering angle, torque).<\/li>\n<li data-start=\"2221\" data-end=\"2395\"><strong data-start=\"2223\" data-end=\"2243\">Reward (<span class=\"katex\"><span class=\"katex-mathml\">rr<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">r<\/span><\/span><\/span><\/span>)<\/strong>: A scalar feedback signal received from the environment after taking an action. Rewards guide the agent by signaling desirable or undesirable outcomes.<\/li>\n<\/ul>\n<p data-start=\"2397\" data-end=\"2621\">The choice of state representation is critical, as it determines the agent\u2019s ability to perceive the environment accurately. Poorly designed states can hinder learning, while informative states facilitate faster convergence.<\/p>\n<h3 data-start=\"2623\" data-end=\"2633\">Policy<\/h3>\n<p data-start=\"2635\" data-end=\"2751\">A <strong data-start=\"2637\" data-end=\"2647\">policy<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">\u03c0\\pi<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><\/span><\/span><\/span> defines the agent\u2019s behavior by specifying a mapping from states to actions. Policies can be:<\/p>\n<ul data-start=\"2753\" data-end=\"2946\">\n<li data-start=\"2753\" data-end=\"2836\"><strong data-start=\"2755\" data-end=\"2772\">Deterministic<\/strong>: <span class=\"katex\"><span class=\"katex-mathml\">a=\u03c0(s)a = \\pi(s)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">a<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>, where the action is fixed for a given state.<\/li>\n<li data-start=\"2837\" data-end=\"2946\"><strong data-start=\"2839\" data-end=\"2853\">Stochastic<\/strong>: <span class=\"katex\"><span class=\"katex-mathml\">\u03c0(a\u2223s)=P(a\u2223s)\\pi(a|s) = P(a|s)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">P<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>, where actions are sampled according to a probability distribution.<\/li>\n<\/ul>\n<p data-start=\"2948\" data-end=\"3169\">Learning an optimal policy is the ultimate objective of reinforcement learning. The policy determines how the agent balances exploration (trying new actions) and exploitation (choosing actions known to yield high reward).<\/p>\n<h3 data-start=\"3171\" data-end=\"3189\">Value Function<\/h3>\n<p data-start=\"3191\" data-end=\"3356\">The <strong data-start=\"3195\" data-end=\"3213\">value function<\/strong> measures how good it is for an agent to be in a certain state, in terms of expected cumulative reward. Value functions come in two main types:<\/p>\n<ol data-start=\"3358\" data-end=\"3859\">\n<li data-start=\"3358\" data-end=\"3584\">\n<p data-start=\"3361\" data-end=\"3477\"><strong data-start=\"3361\" data-end=\"3398\">State-value function (<span class=\"katex\"><span class=\"katex-mathml\">V(s)V(s)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">V<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>)<\/strong>: Expected return starting from state <span class=\"katex\"><span class=\"katex-mathml\">ss<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">s<\/span><\/span><\/span><\/span> and following policy <span class=\"katex\"><span class=\"katex-mathml\">\u03c0\\pi<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><\/span><\/span><\/span>.<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">V\u03c0(s)=E\u03c0[\u2211t=0\u221e\u03b3trt+1\u2223s0=s]V^{\\pi}(s) = \\mathbb{E}_\\pi \\Big[ \\sum_{t=0}^{\\infty} \\gamma^t r_{t+1} \\mid s_0 = s \\Big]<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">\u03c0<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mord mathbb\">E<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">\u03c0<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"delimsizing size2\">[<\/span><\/span><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mrel mtight\">=<\/span>0<\/span><\/span><span class=\"mop op-symbol large-op\">\u2211<\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u221e<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mrel\">\u2223<\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">0<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">s<\/span><span class=\"mord\"><span class=\"delimsizing size2\">]<\/span><\/span><\/span><\/span><\/span><\/span><\/li>\n<li data-start=\"3586\" data-end=\"3859\">\n<p data-start=\"3589\" data-end=\"3743\"><strong data-start=\"3589\" data-end=\"3629\">Action-value function (<span class=\"katex\"><span class=\"katex-mathml\">Q(s,a)Q(s,a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>)<\/strong>: Expected return starting from state <span class=\"katex\"><span class=\"katex-mathml\">ss<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">s<\/span><\/span><\/span><\/span>, taking action <span class=\"katex\"><span class=\"katex-mathml\">aa<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">a<\/span><\/span><\/span><\/span>, and following policy <span class=\"katex\"><span class=\"katex-mathml\">\u03c0\\pi<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><\/span><\/span><\/span> thereafter.<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">Q\u03c0(s,a)=E\u03c0[\u2211t=0\u221e\u03b3trt+1\u2223s0=s,a0=a]Q^{\\pi}(s,a) = \\mathbb{E}_\\pi \\Big[ \\sum_{t=0}^{\\infty} \\gamma^t r_{t+1} \\mid s_0 = s, a_0 = a \\Big]<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">Q<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">\u03c0<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mord mathbb\">E<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">\u03c0<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"delimsizing size2\">[<\/span><\/span><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mrel mtight\">=<\/span>0<\/span><\/span><span class=\"mop op-symbol large-op\">\u2211<\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u221e<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mrel\">\u2223<\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">0<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">0<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">a<\/span><span class=\"mord\"><span class=\"delimsizing size2\">]<\/span><\/span><\/span><\/span><\/span><\/span><\/li>\n<\/ol>\n<p data-start=\"3861\" data-end=\"4072\">Value functions are critical because they provide a measure of long-term desirability of states or actions, enabling informed decision-making. Most RL algorithms aim to estimate these value functions accurately.<\/p>\n<h3 data-start=\"4074\" data-end=\"4102\">Reward Signal and Return<\/h3>\n<p data-start=\"4104\" data-end=\"4418\">The <strong data-start=\"4108\" data-end=\"4125\">reward signal<\/strong> is the only guidance the agent receives from the environment. It indicates whether an action leads to a favorable outcome. However, the reward at a single time step may not capture long-term consequences. Therefore, RL defines the <strong data-start=\"4357\" data-end=\"4367\">return<\/strong>, typically the discounted sum of future rewards:<\/p>\n<p data-start=\"8147\" data-end=\"8729\"><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">Gt=rt+1+\u03b3rt+2+\u03b32rt+3+\u22ef=\u2211k=0\u221e\u03b3krt+k+1G_t = r_{t+1} + \\gamma r_{t+2} + \\gamma^2 r_{t+3} + \\dots = \\sum_{k=0}^{\\infty} \\gamma^k r_{t+k+1}<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">G<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>2<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>3<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"minner\">\u22ef<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">k<\/span><span class=\"mrel mtight\">=<\/span>0<\/span><\/span><span class=\"mop op-symbol large-op\">\u2211<\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u221e<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">k<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span><span class=\"mord mathnormal mtight\">k<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"4526\" data-end=\"4747\">Here, <span class=\"katex\"><span class=\"katex-mathml\">\u03b3\u2208[0,1]\\gamma \\in [0,1]<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"mrel\">\u2208<\/span><\/span><span class=\"base\"><span class=\"mopen\">[<\/span><span class=\"mord\">0<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\">1<\/span><span class=\"mclose\">]<\/span><\/span><\/span><\/span> is the <strong data-start=\"4562\" data-end=\"4581\">discount factor<\/strong>, which balances the importance of immediate and future rewards. A high <span class=\"katex\"><span class=\"katex-mathml\">\u03b3\\gamma<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><\/span><\/span><\/span> encourages long-term planning, while a low <span class=\"katex\"><span class=\"katex-mathml\">\u03b3\\gamma<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><\/span><\/span><\/span> favors immediate rewards.<\/p>\n<h3 data-start=\"4749\" data-end=\"4777\">Model of the Environment<\/h3>\n<p data-start=\"4779\" data-end=\"4932\">A <strong data-start=\"4781\" data-end=\"4790\">model<\/strong> of the environment provides information about state transitions and rewards. RL approaches are categorized based on whether they use a model:<\/p>\n<ul data-start=\"4934\" data-end=\"5221\">\n<li data-start=\"4934\" data-end=\"5082\"><strong data-start=\"4936\" data-end=\"4954\">Model-based RL<\/strong>: The agent has or learns a model <span class=\"katex\"><span class=\"katex-mathml\">P(s\u2032,r\u2223s,a)P(s&#8217;,r|s,a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">P<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">r<\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> of the environment\u2019s dynamics. Planning can be performed using this model.<\/li>\n<li data-start=\"5083\" data-end=\"5221\"><strong data-start=\"5085\" data-end=\"5102\">Model-free RL<\/strong>: The agent learns value functions or policies directly from experience without a model. Q-learning is a prime example.<\/li>\n<\/ul>\n<p data-start=\"5223\" data-end=\"5371\">The distinction between model-based and model-free RL affects sample efficiency, computational complexity, and applicability to real-world problems.<\/p>\n<h3 data-start=\"5373\" data-end=\"5405\">Exploration vs. Exploitation<\/h3>\n<p data-start=\"5407\" data-end=\"5651\">A central challenge in RL is the <strong data-start=\"5440\" data-end=\"5478\">exploration-exploitation trade-off<\/strong>. Agents must explore unknown actions to discover potentially better strategies, but they must also exploit current knowledge to maximize rewards. Common strategies include:<\/p>\n<ul data-start=\"5653\" data-end=\"6002\">\n<li data-start=\"5653\" data-end=\"5768\"><strong data-start=\"5655\" data-end=\"5667\">\u03b5-greedy<\/strong>: With probability <span class=\"katex\"><span class=\"katex-mathml\">\u03f5\\epsilon<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03f5<\/span><\/span><\/span><\/span>, choose a random action; otherwise, choose the best-known action.<\/li>\n<li data-start=\"5769\" data-end=\"5875\"><strong data-start=\"5771\" data-end=\"5810\">Softmax\/Probability-based selection<\/strong>: Assign action probabilities proportional to expected rewards.<\/li>\n<li data-start=\"5876\" data-end=\"6002\"><strong data-start=\"5878\" data-end=\"5910\">Upper Confidence Bound (UCB)<\/strong>: Balance exploration and exploitation by considering both estimated reward and uncertainty.<\/li>\n<\/ul>\n<p data-start=\"6004\" data-end=\"6099\">Effective exploration is essential to avoid local optima and achieve globally optimal policies.<\/p>\n<h3 data-start=\"6101\" data-end=\"6134\">Markov Decision Process (MDP)<\/h3>\n<p data-start=\"6136\" data-end=\"6245\">Most RL problems are formalized as <strong data-start=\"6171\" data-end=\"6207\">Markov Decision Processes (MDPs)<\/strong>, a mathematical framework defined by:<\/p>\n<ul data-start=\"6247\" data-end=\"6468\">\n<li data-start=\"6247\" data-end=\"6273\"><span class=\"katex\"><span class=\"katex-mathml\">SS<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">S<\/span><\/span><\/span><\/span>: Set of states<\/li>\n<li data-start=\"6274\" data-end=\"6301\"><span class=\"katex\"><span class=\"katex-mathml\">AA<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">A<\/span><\/span><\/span><\/span>: Set of actions<\/li>\n<li data-start=\"6302\" data-end=\"6397\"><span class=\"katex\"><span class=\"katex-mathml\">P(s\u2032\u2223s,a)P(s&#8217;|s,a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">P<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>: Transition probability from state <span class=\"katex\"><span class=\"katex-mathml\">ss<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">s<\/span><\/span><\/span><\/span> to <span class=\"katex\"><span class=\"katex-mathml\">s\u2032s&#8217;<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> given action <span class=\"katex\"><span class=\"katex-mathml\">aa<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">a<\/span><\/span><\/span><\/span><\/li>\n<li data-start=\"6398\" data-end=\"6434\"><span class=\"katex\"><span class=\"katex-mathml\">R(s,a,s\u2032)R(s,a,s&#8217;)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">R<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>: Reward function<\/li>\n<li data-start=\"6435\" data-end=\"6468\"><span class=\"katex\"><span class=\"katex-mathml\">\u03b3\\gamma<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><\/span><\/span><\/span>: Discount factor<\/li>\n<\/ul>\n<p data-start=\"6470\" data-end=\"6707\">MDPs assume the <strong data-start=\"6486\" data-end=\"6505\">Markov property<\/strong>, where the future depends only on the current state and action, not the history. MDPs provide the foundation for deriving optimal policies using dynamic programming, value iteration, and RL algorithms.<\/p>\n<h3 data-start=\"6709\" data-end=\"6730\">Bellman Equations<\/h3>\n<p data-start=\"6732\" data-end=\"6911\">The <strong data-start=\"6736\" data-end=\"6757\">Bellman equations<\/strong> define recursive relationships for value functions, forming the theoretical backbone of RL. For a policy <span class=\"katex\"><span class=\"katex-mathml\">\u03c0\\pi<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><\/span><\/span><\/span>, the state-value function satisfies:<\/p>\n<p data-start=\"8147\" data-end=\"8729\"><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">V\u03c0(s)=\u2211a\u03c0(a\u2223s)\u2211s\u2032P(s\u2032\u2223s,a)[R(s,a,s\u2032)+\u03b3V\u03c0(s\u2032)]V^\\pi(s) = \\sum_a \\pi(a|s) \\sum_{s&#8217;} P(s&#8217;|s,a) [ R(s,a,s&#8217;) + \\gamma V^\\pi(s&#8217;) ]<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">\u03c0<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">a<\/span><\/span><span class=\"mop op-symbol large-op\">\u2211<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">\u03c0<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"sizing reset-size3 size1 mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><span class=\"mop op-symbol large-op\">\u2211<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">P<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><span class=\"mopen\">[<\/span><span class=\"mord mathnormal\">R<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">\u03c0<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)]<\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"7000\" data-end=\"7069\">Similarly, the optimal state-value function <span class=\"katex\"><span class=\"katex-mathml\">V\u2217(s)V^*(s)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> satisfies:<\/p>\n<p data-start=\"8147\" data-end=\"8729\"><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">V\u2217(s)=max\u2061a\u2211s\u2032P(s\u2032\u2223s,a)[R(s,a,s\u2032)+\u03b3V\u2217(s\u2032)]V^*(s) = \\max_a \\sum_{s&#8217;} P(s&#8217;|s,a) [ R(s,a,s&#8217;) + \\gamma V^*(s&#8217;) ]<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">a<\/span><\/span><span class=\"mop\">max<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"sizing reset-size3 size1 mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><span class=\"mop op-symbol large-op\">\u2211<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">P<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><span class=\"mopen\">[<\/span><span class=\"mord mathnormal\">R<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)]<\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"7146\" data-end=\"7287\">These equations enable algorithms like <strong data-start=\"7185\" data-end=\"7204\">value iteration<\/strong>, <strong data-start=\"7206\" data-end=\"7226\">policy iteration<\/strong>, and <strong data-start=\"7232\" data-end=\"7246\">Q-learning<\/strong> to compute optimal policies efficiently.<\/p>\n<h3 data-start=\"7289\" data-end=\"7323\">Policy and Value-Based Methods<\/h3>\n<p data-start=\"7325\" data-end=\"7396\">Reinforcement learning algorithms generally fall into three categories:<\/p>\n<ol data-start=\"7398\" data-end=\"7889\">\n<li data-start=\"7398\" data-end=\"7523\"><strong data-start=\"7401\" data-end=\"7424\">Value-based methods<\/strong>: Estimate value functions and derive policies indirectly. Examples include Q-learning and SARSA.<\/li>\n<li data-start=\"7524\" data-end=\"7692\"><strong data-start=\"7527\" data-end=\"7551\">Policy-based methods<\/strong>: Directly optimize the policy using gradient ascent on expected return. Examples include REINFORCE and actor-only policy gradient methods.<\/li>\n<li data-start=\"7693\" data-end=\"7889\"><strong data-start=\"7696\" data-end=\"7720\">Actor-Critic methods<\/strong>: Combine value and policy learning. The critic estimates value functions, and the actor updates the policy using feedback from the critic. Examples include A3C and PPO.<\/li>\n<\/ol>\n<p data-start=\"7891\" data-end=\"8013\">These methods address different challenges such as convergence stability, continuous action spaces, and sample efficiency.<\/p>\n<h3 data-start=\"8015\" data-end=\"8063\">Temporal-Difference and Monte Carlo Learning<\/h3>\n<p data-start=\"8065\" data-end=\"8192\">RL algorithms often rely on <strong data-start=\"8093\" data-end=\"8130\">temporal-difference (TD) learning<\/strong> and <strong data-start=\"8135\" data-end=\"8163\">Monte Carlo (MC) methods<\/strong> to estimate value functions:<\/p>\n<ul data-start=\"8194\" data-end=\"8506\">\n<li data-start=\"8194\" data-end=\"8328\"><strong data-start=\"8196\" data-end=\"8210\">MC methods<\/strong>: Compute value estimates from complete episodes by averaging returns. They are unbiased but require episodic tasks.<\/li>\n<li data-start=\"8329\" data-end=\"8506\"><strong data-start=\"8331\" data-end=\"8345\">TD methods<\/strong>: Update estimates incrementally using the difference between predicted and observed rewards, allowing learning from incomplete episodes and online adaptation.<\/li>\n<\/ul>\n<p data-start=\"8508\" data-end=\"8675\">TD learning is particularly important because it combines ideas from dynamic programming and trial-and-error learning, forming the basis for most modern RL algorithms.<\/p>\n<h3 data-start=\"8677\" data-end=\"8703\">Function Approximation<\/h3>\n<p data-start=\"8705\" data-end=\"9100\">Real-world environments often have large or continuous state spaces, making tabular methods infeasible. <strong data-start=\"8809\" data-end=\"8835\">Function approximation<\/strong> techniques, such as <strong data-start=\"8856\" data-end=\"8873\">linear models<\/strong>, <strong data-start=\"8875\" data-end=\"8894\">neural networks<\/strong>, or <strong data-start=\"8899\" data-end=\"8923\">kernel-based methods<\/strong>, are used to generalize value functions or policies across states. This enables RL agents to scale to high-dimensional problems like image-based environments or robotics tasks.<\/p>\n<h3 data-start=\"9102\" data-end=\"9129\">Key Terminology Summary<\/h3>\n<div class=\"TyagGW_tableContainer\">\n<div class=\"group TyagGW_tableWrapper flex flex-col-reverse w-fit\" tabindex=\"-1\">\n<table class=\"w-fit min-w-(--thread-content-width)\" data-start=\"9131\" data-end=\"10122\">\n<thead data-start=\"9131\" data-end=\"9152\">\n<tr data-start=\"9131\" data-end=\"9152\">\n<th class=\"\" data-start=\"9131\" data-end=\"9138\" data-col-size=\"sm\">Term<\/th>\n<th class=\"\" data-start=\"9138\" data-end=\"9152\" data-col-size=\"md\">Definition<\/th>\n<\/tr>\n<\/thead>\n<tbody data-start=\"9175\" data-end=\"10122\">\n<tr data-start=\"9175\" data-end=\"9235\">\n<td data-start=\"9175\" data-end=\"9183\" data-col-size=\"sm\">Agent<\/td>\n<td data-start=\"9183\" data-end=\"9235\" data-col-size=\"md\">The learner or decision-maker in the environment<\/td>\n<\/tr>\n<tr data-start=\"9236\" data-end=\"9316\">\n<td data-start=\"9236\" data-end=\"9250\" data-col-size=\"sm\">Environment<\/td>\n<td data-start=\"9250\" data-end=\"9316\" data-col-size=\"md\">Everything external to the agent, providing states and rewards<\/td>\n<\/tr>\n<tr data-start=\"9317\" data-end=\"9381\">\n<td data-start=\"9317\" data-end=\"9329\" data-col-size=\"sm\">State (s)<\/td>\n<td data-start=\"9329\" data-end=\"9381\" data-col-size=\"md\">Representation of the environment at a time step<\/td>\n<\/tr>\n<tr data-start=\"9382\" data-end=\"9434\">\n<td data-start=\"9382\" data-end=\"9395\" data-col-size=\"sm\">Action (a)<\/td>\n<td data-start=\"9395\" data-end=\"9434\" data-col-size=\"md\">Choice made by the agent in a state<\/td>\n<\/tr>\n<tr data-start=\"9435\" data-end=\"9506\">\n<td data-start=\"9435\" data-end=\"9448\" data-col-size=\"sm\">Reward (r)<\/td>\n<td data-start=\"9448\" data-end=\"9506\" data-col-size=\"md\">Scalar feedback signal indicating success of an action<\/td>\n<\/tr>\n<tr data-start=\"9507\" data-end=\"9590\">\n<td data-start=\"9507\" data-end=\"9520\" data-col-size=\"sm\">Policy (\u03c0)<\/td>\n<td data-start=\"9520\" data-end=\"9590\" data-col-size=\"md\">Mapping from states to actions, can be deterministic or stochastic<\/td>\n<\/tr>\n<tr data-start=\"9591\" data-end=\"9679\">\n<td data-start=\"9591\" data-end=\"9615\" data-col-size=\"sm\">Value Function (V, Q)<\/td>\n<td data-col-size=\"md\" data-start=\"9615\" data-end=\"9679\">Expected cumulative reward from a state or state-action pair<\/td>\n<\/tr>\n<tr data-start=\"9680\" data-end=\"9729\">\n<td data-start=\"9680\" data-end=\"9693\" data-col-size=\"sm\">Return (G)<\/td>\n<td data-start=\"9693\" data-end=\"9729\" data-col-size=\"md\">Discounted sum of future rewards<\/td>\n<\/tr>\n<tr data-start=\"9730\" data-end=\"9804\">\n<td data-start=\"9730\" data-end=\"9752\" data-col-size=\"sm\">Discount Factor (\u03b3)<\/td>\n<td data-start=\"9752\" data-end=\"9804\" data-col-size=\"md\">Parameter weighting future vs. immediate rewards<\/td>\n<\/tr>\n<tr data-start=\"9805\" data-end=\"9860\">\n<td data-start=\"9805\" data-end=\"9813\" data-col-size=\"sm\">Model<\/td>\n<td data-col-size=\"md\" data-start=\"9813\" data-end=\"9860\">Knowledge of environment dynamics (P and R)<\/td>\n<\/tr>\n<tr data-start=\"9861\" data-end=\"9919\">\n<td data-start=\"9861\" data-end=\"9875\" data-col-size=\"sm\">Exploration<\/td>\n<td data-start=\"9875\" data-end=\"9919\" data-col-size=\"md\">Trying new actions to gather information<\/td>\n<\/tr>\n<tr data-start=\"9920\" data-end=\"9988\">\n<td data-start=\"9920\" data-end=\"9935\" data-col-size=\"sm\">Exploitation<\/td>\n<td data-start=\"9935\" data-end=\"9988\" data-col-size=\"md\">Choosing the best-known action to maximize reward<\/td>\n<\/tr>\n<tr data-start=\"9989\" data-end=\"10057\">\n<td data-start=\"9989\" data-end=\"10015\" data-col-size=\"sm\">Markov Decision Process<\/td>\n<td data-start=\"10015\" data-end=\"10057\" data-col-size=\"md\">Mathematical framework (S, A, P, R, \u03b3)<\/td>\n<\/tr>\n<tr data-start=\"10058\" data-end=\"10122\">\n<td data-start=\"10058\" data-end=\"10077\" data-col-size=\"sm\">Bellman Equation<\/td>\n<td data-col-size=\"md\" data-start=\"10077\" data-end=\"10122\">Recursive formulation for value functions<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2 data-start=\"114\" data-end=\"155\">Key Features of Reinforcement Learning<\/h2>\n<p data-start=\"157\" data-end=\"693\">Reinforcement Learning (RL) is a unique branch of machine learning that emphasizes learning optimal behaviors through interactions with an environment. Unlike supervised learning, where explicit labeled data guides the learning process, RL depends on feedback from the environment in the form of rewards or penalties. Understanding the key features of reinforcement learning is essential to grasp what distinguishes it from other learning paradigms and why it is particularly suitable for dynamic decision-making and sequential tasks.<\/p>\n<h3 data-start=\"695\" data-end=\"730\">1. Learning Through Interaction<\/h3>\n<p data-start=\"732\" data-end=\"1022\">One of the most fundamental features of reinforcement learning is that the agent learns by interacting with the environment. The agent does not rely on a pre-defined dataset; instead, it observes the state of the environment, selects actions, and receives feedback in the form of rewards.<\/p>\n<p data-start=\"1024\" data-end=\"1414\">This trial-and-error approach is intrinsic to RL. By taking actions and experiencing their consequences, the agent gradually learns which behaviors yield the highest cumulative reward. This interactive learning allows RL systems to adapt to environments that are partially known or dynamic, making it ideal for real-world applications such as robotics, autonomous driving, and game playing.<\/p>\n<h3 data-start=\"1416\" data-end=\"1445\">2. Goal-Oriented Behavior<\/h3>\n<p data-start=\"1447\" data-end=\"1782\">Reinforcement learning is inherently <strong data-start=\"1484\" data-end=\"1501\">goal-oriented<\/strong>. The agent is not just learning patterns or correlations; it is learning to maximize a cumulative objective, usually referred to as the <strong data-start=\"1638\" data-end=\"1648\">return<\/strong>. The return is the sum of immediate and future rewards, often discounted by a factor <span class=\"katex\"><span class=\"katex-mathml\">\u03b3\\gamma<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><\/span><\/span><\/span> to prioritize near-term outcomes.<\/p>\n<p data-start=\"1784\" data-end=\"2072\">This feature differentiates RL from supervised learning, where the goal is to minimize prediction error rather than achieve long-term objectives. In RL, the agent must consider the consequences of its actions over time, requiring a balance between short-term gains and long-term benefits.<\/p>\n<h3 data-start=\"2074\" data-end=\"2105\">3. Trial-and-Error Learning<\/h3>\n<p data-start=\"2107\" data-end=\"2414\"><strong data-start=\"2107\" data-end=\"2135\">Trial-and-error learning<\/strong> is a hallmark of reinforcement learning. Unlike supervised learning, where the correct output is provided, RL agents must discover effective behaviors themselves. The agent tries different actions in various states, observes the outcomes, and updates its strategy accordingly.<\/p>\n<p data-start=\"2416\" data-end=\"2723\">This process allows RL agents to explore unknown situations and adapt to novel environments. However, trial-and-error learning also introduces challenges, particularly in balancing <strong data-start=\"2597\" data-end=\"2612\">exploration<\/strong> (trying new actions to gain information) with <strong data-start=\"2659\" data-end=\"2675\">exploitation<\/strong> (choosing actions known to yield high rewards).<\/p>\n<h3 data-start=\"2725\" data-end=\"2768\">4. Delayed Reward and Credit Assignment<\/h3>\n<p data-start=\"2770\" data-end=\"3013\">A defining characteristic of RL is the <strong data-start=\"2809\" data-end=\"2835\">delayed reward problem<\/strong>. Often, the outcomes of actions are not immediately apparent. An action taken at an early stage might influence the environment significantly only after many subsequent steps.<\/p>\n<p data-start=\"3015\" data-end=\"3440\">This introduces the <strong data-start=\"3035\" data-end=\"3064\">credit assignment problem<\/strong>, where the agent must determine which actions contributed to eventual rewards or penalties. Temporal-difference learning and eligibility traces are two techniques designed to address this challenge, allowing agents to propagate reward information backward to earlier decisions. Handling delayed rewards effectively is crucial for achieving long-term success in complex tasks.<\/p>\n<h3 data-start=\"3442\" data-end=\"3477\">5. Policy-Based Decision Making<\/h3>\n<p data-start=\"3479\" data-end=\"3722\">Reinforcement learning is focused on learning <strong data-start=\"3525\" data-end=\"3537\">policies<\/strong>, which define how an agent behaves in different states. A policy <span class=\"katex\"><span class=\"katex-mathml\">\u03c0\\pi<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><\/span><\/span><\/span> can be deterministic (specific actions for specific states) or stochastic (probabilistic action selection).<\/p>\n<p data-start=\"3724\" data-end=\"4084\">The emphasis on policies allows RL agents to make sequential decisions that are consistent and goal-directed. Unlike supervised learning models, which produce static predictions, RL policies continuously interact with the environment and evolve based on observed feedback. This dynamic decision-making capability is essential for adaptive, intelligent systems.<\/p>\n<h3 data-start=\"4086\" data-end=\"4118\">6. Value Function Estimation<\/h3>\n<p data-start=\"4120\" data-end=\"4414\">Another key feature of RL is <strong data-start=\"4149\" data-end=\"4178\">value function estimation<\/strong>, which measures the expected cumulative reward of a state or state-action pair. The value function provides the agent with a predictive framework to evaluate how good a particular state or action is in achieving long-term objectives.<\/p>\n<p data-start=\"4416\" data-end=\"4694\">Value functions, such as the state-value function <span class=\"katex\"><span class=\"katex-mathml\">V(s)V(s)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">V<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> or the action-value function <span class=\"katex\"><span class=\"katex-mathml\">Q(s,a)Q(s,a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>, form the basis of most RL algorithms. By estimating value, agents can make informed decisions even in unfamiliar states, improving learning efficiency and policy performance.<\/p>\n<h3 data-start=\"4696\" data-end=\"4741\">7. Exploration vs. Exploitation Trade-Off<\/h3>\n<p data-start=\"4743\" data-end=\"4973\">The <strong data-start=\"4747\" data-end=\"4785\">exploration-exploitation trade-off<\/strong> is central to reinforcement learning. Agents must explore new actions to gather information about the environment, while simultaneously exploiting known actions that yield high rewards.<\/p>\n<p data-start=\"4975\" data-end=\"5289\">This trade-off is essential because an agent that only exploits may miss better strategies, while an agent that only explores may fail to maximize reward. Strategies such as \u03b5-greedy selection, softmax action selection, and Upper Confidence Bound (UCB) methods are commonly used to manage this balance effectively.<\/p>\n<h3 data-start=\"5291\" data-end=\"5333\">8. Model-Based and Model-Free Learning<\/h3>\n<p data-start=\"5335\" data-end=\"5440\">Reinforcement learning can be categorized based on whether the agent uses a <strong data-start=\"5411\" data-end=\"5439\">model of the environment<\/strong>:<\/p>\n<ul data-start=\"5442\" data-end=\"5779\">\n<li data-start=\"5442\" data-end=\"5617\"><strong data-start=\"5444\" data-end=\"5462\">Model-based RL<\/strong>: The agent learns or has access to a model <span class=\"katex\"><span class=\"katex-mathml\">P(s\u2032,r\u2223s,a)P(s&#8217;,r|s,a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">P<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">r<\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> of how the environment behaves. It can plan and simulate future actions before taking them.<\/li>\n<li data-start=\"5618\" data-end=\"5779\"><strong data-start=\"5620\" data-end=\"5637\">Model-free RL<\/strong>: The agent learns policies or value functions directly from experience, without an explicit model. Q-learning and SARSA are popular examples.<\/li>\n<\/ul>\n<p data-start=\"5781\" data-end=\"5962\">This distinction highlights the flexibility of RL: agents can learn in environments where modeling is feasible or rely solely on experience when modeling is difficult or impossible.<\/p>\n<h3 data-start=\"5964\" data-end=\"6010\">9. Adaptivity and Learning from Experience<\/h3>\n<p data-start=\"6012\" data-end=\"6264\">A defining feature of RL is its <strong data-start=\"6044\" data-end=\"6058\">adaptivity<\/strong>. RL agents improve their performance over time by continuously learning from their experiences. This allows them to adjust to changing environments, update strategies, and handle non-stationary dynamics.<\/p>\n<p data-start=\"6266\" data-end=\"6533\">Unlike traditional optimization methods that assume a fixed problem structure, RL agents are dynamic learners. This adaptivity is particularly valuable in applications where environments evolve, such as financial markets, autonomous vehicles, and multi-agent systems.<\/p>\n<h3 data-start=\"6535\" data-end=\"6569\">10. Sequential Decision-Making<\/h3>\n<p data-start=\"6571\" data-end=\"6851\">Reinforcement learning is inherently concerned with <strong data-start=\"6623\" data-end=\"6653\">sequential decision-making<\/strong>, where the outcome of one action affects future states and available actions. The interdependence of actions over time distinguishes RL from single-step decision problems or classification tasks.<\/p>\n<p data-start=\"6853\" data-end=\"7095\">Sequential decision-making requires agents to plan ahead, evaluate consequences, and optimize over multiple steps. Techniques like dynamic programming, temporal-difference learning, and policy gradients are designed to handle this complexity.<\/p>\n<\/div>\n<h2 data-start=\"137\" data-end=\"205\">The Reinforcement Learning Framework (Agent, Environment, Reward)<\/h2>\n<p data-start=\"207\" data-end=\"869\">Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with its environment to maximize cumulative rewards. Unlike supervised learning, which relies on labeled datasets, RL is driven by experience and feedback, making it particularly suitable for dynamic, uncertain, and sequential decision-making problems. At the heart of reinforcement learning lies a framework built on three core components: the <strong data-start=\"666\" data-end=\"675\">agent<\/strong>, the <strong data-start=\"681\" data-end=\"696\">environment<\/strong>, and the <strong data-start=\"706\" data-end=\"723\">reward signal<\/strong>. Understanding this framework is essential for designing RL systems, analyzing algorithms, and applying them effectively to real-world scenarios.<\/p>\n<h3 data-start=\"871\" data-end=\"916\">The Agent: The Learner and Decision-Maker<\/h3>\n<p data-start=\"918\" data-end=\"1267\">In reinforcement learning, the <strong data-start=\"949\" data-end=\"958\">agent<\/strong> is the learner or decision-maker. It is responsible for choosing actions based on observations of the environment and for improving its strategy over time to maximize rewards. The agent can be a software program, a robot, or any autonomous system capable of perceiving its environment and acting within it.<\/p>\n<p data-start=\"1269\" data-end=\"1455\">The behavior of an agent is governed by a <strong data-start=\"1311\" data-end=\"1321\">policy<\/strong>, denoted as <span class=\"katex\"><span class=\"katex-mathml\">\u03c0\\pi<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><\/span><\/span><\/span>. A policy defines a mapping from observed <strong data-start=\"1386\" data-end=\"1396\">states<\/strong> to actions and can be <strong data-start=\"1419\" data-end=\"1436\">deterministic<\/strong> or <strong data-start=\"1440\" data-end=\"1454\">stochastic<\/strong>:<\/p>\n<ul data-start=\"1457\" data-end=\"1798\">\n<li data-start=\"1457\" data-end=\"1570\"><strong data-start=\"1459\" data-end=\"1483\">Deterministic Policy<\/strong>: Maps a state <span class=\"katex\"><span class=\"katex-mathml\">ss<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">s<\/span><\/span><\/span><\/span> directly to a specific action <span class=\"katex\"><span class=\"katex-mathml\">aa<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">a<\/span><\/span><\/span><\/span>, i.e., <span class=\"katex\"><span class=\"katex-mathml\">a=\u03c0(s)a = \\pi(s)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">a<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>.<\/li>\n<li data-start=\"1571\" data-end=\"1798\"><strong data-start=\"1573\" data-end=\"1594\">Stochastic Policy<\/strong>: Provides a probability distribution over actions in a given state, i.e., <span class=\"katex\"><span class=\"katex-mathml\">\u03c0(a\u2223s)=P(a\u2223s)\\pi(a|s) = P(a|s)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">P<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>. This allows the agent to incorporate randomness in action selection, which is crucial for exploration.<\/li>\n<\/ul>\n<p data-start=\"1800\" data-end=\"2281\">The agent\u2019s primary task is to learn an <strong data-start=\"1840\" data-end=\"1858\">optimal policy<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">\u03c0\u2217\\pi^*<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">\u03c0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>, which maximizes the expected cumulative reward over time. To do this, the agent relies on observations, past experiences, and feedback from the environment to update its knowledge or value estimations. Modern RL agents often use <strong data-start=\"2101\" data-end=\"2120\">value functions<\/strong>, <strong data-start=\"2122\" data-end=\"2134\">Q-values<\/strong>, or deep neural networks to represent their policy and expected future rewards, allowing them to handle complex and high-dimensional state spaces.<\/p>\n<h3 data-start=\"2283\" data-end=\"2322\">The Environment: The External World<\/h3>\n<p data-start=\"2324\" data-end=\"2674\">The <strong data-start=\"2328\" data-end=\"2343\">environment<\/strong> encompasses everything external to the agent. It defines the context in which the agent operates, including the states, dynamics, rules, and external conditions that determine how actions lead to consequences. In reinforcement learning, the environment is often formalized as a <strong data-start=\"2622\" data-end=\"2655\">Markov Decision Process (MDP)<\/strong>, characterized by:<\/p>\n<ul data-start=\"2676\" data-end=\"3180\">\n<li data-start=\"2676\" data-end=\"2742\"><span class=\"katex\"><span class=\"katex-mathml\">SS<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">S<\/span><\/span><\/span><\/span>: Set of all possible states the environment can be in.<\/li>\n<li data-start=\"2743\" data-end=\"2807\"><span class=\"katex\"><span class=\"katex-mathml\">AA<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">A<\/span><\/span><\/span><\/span>: Set of all possible actions available to the agent.<\/li>\n<li data-start=\"2808\" data-end=\"2952\"><span class=\"katex\"><span class=\"katex-mathml\">P(s\u2032\u2223s,a)P(s&#8217;|s,a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">P<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>: State transition probability, describing how the environment evolves after the agent takes action <span class=\"katex\"><span class=\"katex-mathml\">aa<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">a<\/span><\/span><\/span><\/span> in state <span class=\"katex\"><span class=\"katex-mathml\">ss<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">s<\/span><\/span><\/span><\/span>.<\/li>\n<li data-start=\"2953\" data-end=\"3107\"><span class=\"katex\"><span class=\"katex-mathml\">R(s,a,s\u2032)R(s,a,s&#8217;)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">R<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>: Reward function, specifying the immediate reward received when transitioning from state <span class=\"katex\"><span class=\"katex-mathml\">ss<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">s<\/span><\/span><\/span><\/span> to state <span class=\"katex\"><span class=\"katex-mathml\">s\u2032s&#8217;<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> via action <span class=\"katex\"><span class=\"katex-mathml\">aa<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">a<\/span><\/span><\/span><\/span>.<\/li>\n<li data-start=\"3108\" data-end=\"3180\"><span class=\"katex\"><span class=\"katex-mathml\">\u03b3\\gamma<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><\/span><\/span><\/span>: Discount factor, balancing immediate and future rewards.<\/li>\n<\/ul>\n<p data-start=\"3182\" data-end=\"3630\">The environment responds to the agent\u2019s actions by transitioning to a new state and delivering a reward. Importantly, the environment may be <strong data-start=\"3323\" data-end=\"3340\">deterministic<\/strong>, where actions lead to predictable outcomes, or <strong data-start=\"3389\" data-end=\"3403\">stochastic<\/strong>, where outcomes are probabilistic. The complexity of the environment directly affects the agent\u2019s learning process: the more uncertain and dynamic the environment, the more sophisticated the agent must be to adapt effectively.<\/p>\n<p data-start=\"3632\" data-end=\"3965\">Environments can range from simple grid worlds, used in educational examples, to highly complex real-world systems such as autonomous driving, robotic manipulation, or financial trading platforms. The environment defines the <strong data-start=\"3857\" data-end=\"3878\">rules of the game<\/strong>, and the agent\u2019s success depends on its ability to understand and exploit these rules.<\/p>\n<h3 data-start=\"3967\" data-end=\"4011\">The Reward Signal: Feedback for Learning<\/h3>\n<p data-start=\"4013\" data-end=\"4386\">The <strong data-start=\"4017\" data-end=\"4027\">reward<\/strong> is the scalar feedback signal provided by the environment after an action is taken. It is the primary mechanism through which the agent evaluates its behavior and learns which actions are beneficial or detrimental. Rewards guide the agent toward achieving its goal, which is typically framed as <strong data-start=\"4323\" data-end=\"4355\">maximizing cumulative reward<\/strong>, also called the <strong data-start=\"4373\" data-end=\"4383\">return<\/strong>.<\/p>\n<p data-start=\"4388\" data-end=\"4494\">The cumulative return <span class=\"katex\"><span class=\"katex-mathml\">GtG_t<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">G<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> at time step <span class=\"katex\"><span class=\"katex-mathml\">tt<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">t<\/span><\/span><\/span><\/span> is defined as the sum of discounted future rewards:<\/p>\n<p data-start=\"6853\" data-end=\"7095\"><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">Gt=rt+1+\u03b3rt+2+\u03b32rt+3+\u22ef=\u2211k=0\u221e\u03b3krt+k+1G_t = r_{t+1} + \\gamma r_{t+2} + \\gamma^2 r_{t+3} + \\dots = \\sum_{k=0}^{\\infty} \\gamma^k r_{t+k+1}<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">G<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>2<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>3<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"minner\">\u22ef<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">k<\/span><span class=\"mrel mtight\">=<\/span>0<\/span><\/span><span class=\"mop op-symbol large-op\">\u2211<\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u221e<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">k<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span><span class=\"mord mathnormal mtight\">k<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"4602\" data-end=\"4831\">where <span class=\"katex\"><span class=\"katex-mathml\">\u03b3\u2208[0,1]\\gamma \\in [0,1]<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"mrel\">\u2208<\/span><\/span><span class=\"base\"><span class=\"mopen\">[<\/span><span class=\"mord\">0<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\">1<\/span><span class=\"mclose\">]<\/span><\/span><\/span><\/span> is the <strong data-start=\"4638\" data-end=\"4657\">discount factor<\/strong>, representing the agent\u2019s preference for immediate versus future rewards. A high <span class=\"katex\"><span class=\"katex-mathml\">\u03b3\\gamma<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><\/span><\/span><\/span> encourages long-term planning, while a low <span class=\"katex\"><span class=\"katex-mathml\">\u03b3\\gamma<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><\/span><\/span><\/span> favors immediate gains.<\/p>\n<p data-start=\"4833\" data-end=\"4889\">The reward signal has several essential characteristics:<\/p>\n<ol data-start=\"4891\" data-end=\"5517\">\n<li data-start=\"4891\" data-end=\"5053\"><strong data-start=\"4894\" data-end=\"4911\">Scalar Nature<\/strong>: The reward is typically a single numerical value, simplifying the feedback while conveying essential information about success or failure.<\/li>\n<li data-start=\"5054\" data-end=\"5261\"><strong data-start=\"5057\" data-end=\"5080\">Sparsity or Density<\/strong>: Rewards can be dense (frequent feedback) or sparse (rare feedback), affecting how quickly the agent can learn. Sparse rewards often require sophisticated exploration strategies.<\/li>\n<li data-start=\"5262\" data-end=\"5517\"><strong data-start=\"5265\" data-end=\"5276\">Shaping<\/strong>: Rewards can be designed to guide the agent toward the desired behavior more effectively, a process known as <strong data-start=\"5386\" data-end=\"5404\">reward shaping<\/strong>. Poorly designed rewards can lead to unintended behaviors, highlighting the importance of careful reward design.<\/li>\n<\/ol>\n<h3 data-start=\"5519\" data-end=\"5567\">Agent-Environment Interaction: The Core Loop<\/h3>\n<p data-start=\"5569\" data-end=\"5705\">The <strong data-start=\"5573\" data-end=\"5593\">interaction loop<\/strong> between the agent and environment is fundamental to reinforcement learning. At each discrete time step <span class=\"katex\"><span class=\"katex-mathml\">tt<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">t<\/span><\/span><\/span><\/span>:<\/p>\n<ol data-start=\"5707\" data-end=\"6066\">\n<li data-start=\"5707\" data-end=\"5778\">The agent observes the current state <span class=\"katex\"><span class=\"katex-mathml\">sts_t<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> of the environment.<\/li>\n<li data-start=\"5779\" data-end=\"5856\">The agent selects an action <span class=\"katex\"><span class=\"katex-mathml\">ata_t<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> according to its policy <span class=\"katex\"><span class=\"katex-mathml\">\u03c0\\pi<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><\/span><\/span><\/span>.<\/li>\n<li data-start=\"5857\" data-end=\"5954\">The environment transitions to a new state <span class=\"katex\"><span class=\"katex-mathml\">st+1s_{t+1}<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> and returns a reward <span class=\"katex\"><span class=\"katex-mathml\">rt+1r_{t+1}<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>.<\/li>\n<li data-start=\"5955\" data-end=\"6066\">The agent updates its policy or value function based on the experience <span class=\"katex\"><span class=\"katex-mathml\">(st,at,rt+1,st+1)(s_t, a_t, r_{t+1}, s_{t+1})<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>.<\/li>\n<\/ol>\n<p data-start=\"6068\" data-end=\"6326\">This loop continues over multiple episodes, where an <strong data-start=\"6121\" data-end=\"6132\">episode<\/strong> is a sequence of interactions that ends in a terminal state. Through repeated interaction, the agent learns which actions lead to higher cumulative rewards, gradually improving its performance.<\/p>\n<h3 data-start=\"6328\" data-end=\"6373\">Formal Framework: Markov Decision Process<\/h3>\n<p data-start=\"6375\" data-end=\"6765\">Most RL problems are formalized using the <strong data-start=\"6417\" data-end=\"6450\">Markov Decision Process (MDP)<\/strong> framework. The MDP assumes the <strong data-start=\"6482\" data-end=\"6501\">Markov property<\/strong>, where the future state depends only on the current state and action, not the history of past states. This assumption simplifies modeling and computation, enabling the derivation of foundational algorithms like <strong data-start=\"6713\" data-end=\"6727\">Q-learning<\/strong>, <strong data-start=\"6729\" data-end=\"6738\">SARSA<\/strong>, and <strong data-start=\"6744\" data-end=\"6764\">policy gradients<\/strong>.<\/p>\n<p data-start=\"6767\" data-end=\"6837\">The components of the RL framework map directly onto the MDP elements:<\/p>\n<ul data-start=\"6839\" data-end=\"7124\">\n<li data-start=\"6839\" data-end=\"6918\"><strong data-start=\"6841\" data-end=\"6850\">Agent<\/strong>: Chooses actions <span class=\"katex\"><span class=\"katex-mathml\">a\u2208Aa \\in A<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">a<\/span><span class=\"mrel\">\u2208<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">A<\/span><\/span><\/span><\/span> according to its policy <span class=\"katex\"><span class=\"katex-mathml\">\u03c0\\pi<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><\/span><\/span><\/span>.<\/li>\n<li data-start=\"6919\" data-end=\"7011\"><strong data-start=\"6921\" data-end=\"6936\">Environment<\/strong>: Provides state transitions <span class=\"katex\"><span class=\"katex-mathml\">P(s\u2032\u2223s,a)P(s&#8217;|s,a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">P<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> and rewards <span class=\"katex\"><span class=\"katex-mathml\">R(s,a,s\u2032)R(s,a,s&#8217;)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">R<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>.<\/li>\n<li data-start=\"7012\" data-end=\"7124\"><strong data-start=\"7014\" data-end=\"7031\">Reward Signal<\/strong>: Guides the agent toward optimal behavior, shaping the policy through learning algorithms.<\/li>\n<\/ul>\n<p data-start=\"7126\" data-end=\"7251\">Value functions and Bellman equations are used to compute optimal policies and expected returns within this formal framework.<\/p>\n<h3 data-start=\"7253\" data-end=\"7276\">Real-World Examples<\/h3>\n<p data-start=\"7278\" data-end=\"7328\">The RL framework is applicable in various domains:<\/p>\n<ul data-start=\"7330\" data-end=\"7832\">\n<li data-start=\"7330\" data-end=\"7493\"><strong data-start=\"7332\" data-end=\"7344\">Robotics<\/strong>: A robot (agent) navigates a warehouse (environment), receiving positive rewards for reaching delivery points and negative rewards for collisions.<\/li>\n<li data-start=\"7494\" data-end=\"7676\"><strong data-start=\"7496\" data-end=\"7506\">Gaming<\/strong>: An AI player (agent) interacts with a video game (environment), earning points (reward) for completing objectives while learning strategies to maximize overall score.<\/li>\n<li data-start=\"7677\" data-end=\"7832\"><strong data-start=\"7679\" data-end=\"7690\">Finance<\/strong>: An automated trading system (agent) operates in the stock market (environment), receiving profit or loss (reward) signals based on trades.<\/li>\n<\/ul>\n<p data-start=\"7834\" data-end=\"7951\">In each case, the agent-environment-reward framework provides a structured approach for learning and decision-making.<\/p>\n<h2 data-start=\"108\" data-end=\"142\">Types of Reinforcement Learning<\/h2>\n<p data-start=\"144\" data-end=\"840\">Reinforcement Learning (RL) is a branch of machine learning in which an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. Unlike supervised learning, RL does not rely on labeled datasets; instead, it depends on feedback obtained through trial-and-error interactions. Over the years, researchers have developed different approaches to reinforcement learning, depending on the availability of information about the environment, the learning objective, and the methods used to optimize policies. Understanding the types of reinforcement learning is critical for selecting suitable algorithms and designing effective solutions for diverse applications.<\/p>\n<h3 data-start=\"842\" data-end=\"898\">1. Model-Based vs. Model-Free Reinforcement Learning<\/h3>\n<p data-start=\"900\" data-end=\"1019\">One of the primary distinctions in reinforcement learning is based on whether the agent has a model of the environment.<\/p>\n<h4 data-start=\"1021\" data-end=\"1060\">Model-Based Reinforcement Learning<\/h4>\n<p data-start=\"1062\" data-end=\"1190\">In <strong data-start=\"1065\" data-end=\"1083\">model-based RL<\/strong>, the agent has access to\u2014or learns\u2014a model of the environment\u2019s dynamics. The model typically consists of:<\/p>\n<ul data-start=\"1192\" data-end=\"1444\">\n<li data-start=\"1192\" data-end=\"1346\"><strong data-start=\"1194\" data-end=\"1223\">State transition function<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">P(s\u2032\u2223s,a)P(s&#8217;|s,a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">P<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>: Predicts the probability of moving to a new state <span class=\"katex\"><span class=\"katex-mathml\">s\u2032s&#8217;<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> when taking action <span class=\"katex\"><span class=\"katex-mathml\">aa<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">a<\/span><\/span><\/span><\/span> in state <span class=\"katex\"><span class=\"katex-mathml\">ss<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">s<\/span><\/span><\/span><\/span>.<\/li>\n<li data-start=\"1347\" data-end=\"1444\"><strong data-start=\"1349\" data-end=\"1368\">Reward function<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">R(s,a)R(s,a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">R<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>: Predicts the expected reward for a given state-action pair.<\/li>\n<\/ul>\n<p data-start=\"1446\" data-end=\"1746\">With this model, the agent can <strong data-start=\"1477\" data-end=\"1498\">simulate outcomes<\/strong> and plan its actions before interacting with the real environment. Model-based RL is advantageous because it is <strong data-start=\"1611\" data-end=\"1631\">sample-efficient<\/strong>, allowing agents to evaluate multiple hypothetical strategies without requiring extensive real-world experience.<\/p>\n<p data-start=\"1748\" data-end=\"2196\">Classical planning algorithms, such as <strong data-start=\"1787\" data-end=\"1806\">value iteration<\/strong> and <strong data-start=\"1811\" data-end=\"1831\">policy iteration<\/strong>, are often applied in model-based RL. Modern applications include robotics, autonomous driving, and industrial control systems, where planning using a learned or known model reduces the cost of trial-and-error in real environments. The main limitation of model-based RL is that constructing accurate models in complex or stochastic environments can be challenging.<\/p>\n<h4 data-start=\"2198\" data-end=\"2236\">Model-Free Reinforcement Learning<\/h4>\n<p data-start=\"2238\" data-end=\"2531\">In <strong data-start=\"2241\" data-end=\"2258\">model-free RL<\/strong>, the agent learns to make decisions <strong data-start=\"2295\" data-end=\"2324\">directly from interaction<\/strong> with the environment without requiring knowledge of transition probabilities or reward functions. Model-free methods rely on estimating <strong data-start=\"2461\" data-end=\"2480\">value functions<\/strong>, <strong data-start=\"2482\" data-end=\"2494\">Q-values<\/strong>, or <strong data-start=\"2499\" data-end=\"2511\">policies<\/strong> using experience.<\/p>\n<p data-start=\"2533\" data-end=\"2570\">Common model-free algorithms include:<\/p>\n<ul data-start=\"2572\" data-end=\"3074\">\n<li data-start=\"2572\" data-end=\"2793\"><strong data-start=\"2574\" data-end=\"2588\">Q-Learning<\/strong>: Learns the optimal action-value function <span class=\"katex\"><span class=\"katex-mathml\">Q(s,a)Q(s,a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> to derive an optimal policy. It is <strong data-start=\"2679\" data-end=\"2693\">off-policy<\/strong>, meaning the learning policy can differ from the behavior policy used to explore the environment.<\/li>\n<li data-start=\"2794\" data-end=\"2935\"><strong data-start=\"2796\" data-end=\"2840\">SARSA (State-Action-Reward-State-Action)<\/strong>: An <strong data-start=\"2845\" data-end=\"2858\">on-policy<\/strong> method that updates Q-values based on actions actually taken by the agent.<\/li>\n<li data-start=\"2936\" data-end=\"3074\"><strong data-start=\"2938\" data-end=\"2965\">Policy Gradient Methods<\/strong>: Directly optimize the policy using gradients of expected rewards, often used in continuous action spaces.<\/li>\n<\/ul>\n<p data-start=\"3076\" data-end=\"3382\">Model-free RL is widely used in applications where constructing a model is infeasible or costly, such as video games, real-time strategy systems, and autonomous navigation. The trade-off is that model-free methods generally require more interactions with the environment, making them less sample-efficient.<\/p>\n<h3 data-start=\"3384\" data-end=\"3442\">2. Value-Based vs. Policy-Based Reinforcement Learning<\/h3>\n<p data-start=\"3444\" data-end=\"3579\">Another way to categorize RL methods is based on whether they focus on <strong data-start=\"3515\" data-end=\"3544\">value function estimation<\/strong> or <strong data-start=\"3548\" data-end=\"3578\">direct policy optimization<\/strong>.<\/p>\n<h4 data-start=\"3581\" data-end=\"3620\">Value-Based Reinforcement Learning<\/h4>\n<p data-start=\"3622\" data-end=\"3863\"><strong data-start=\"3622\" data-end=\"3640\">Value-based RL<\/strong> algorithms learn a function that estimates the expected return from states or state-action pairs. Once the value function is learned, the agent derives an optimal policy by selecting actions that maximize expected value.<\/p>\n<p data-start=\"3865\" data-end=\"3897\">Key value-based methods include:<\/p>\n<ul data-start=\"3899\" data-end=\"4153\">\n<li data-start=\"3899\" data-end=\"4017\"><strong data-start=\"3901\" data-end=\"3915\">Q-Learning<\/strong>: Uses the action-value function <span class=\"katex\"><span class=\"katex-mathml\">Q(s,a)Q(s,a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> to select the action with the highest expected return.<\/li>\n<li data-start=\"4018\" data-end=\"4153\"><strong data-start=\"4020\" data-end=\"4045\">Deep Q-Networks (DQN)<\/strong>: Extend Q-learning to high-dimensional state spaces using deep neural networks as function approximators.<\/li>\n<\/ul>\n<p data-start=\"4155\" data-end=\"4341\">Value-based methods work well in <strong data-start=\"4188\" data-end=\"4214\">discrete action spaces<\/strong> but can struggle in <strong data-start=\"4235\" data-end=\"4263\">continuous action spaces<\/strong>, where selecting the best action by maximizing a Q-function is non-trivial.<\/p>\n<h4 data-start=\"4343\" data-end=\"4383\">Policy-Based Reinforcement Learning<\/h4>\n<p data-start=\"4385\" data-end=\"4610\"><strong data-start=\"4385\" data-end=\"4404\">Policy-based RL<\/strong> methods learn the policy directly, without relying on value functions. The agent adjusts its policy parameters to maximize expected cumulative reward using <strong data-start=\"4561\" data-end=\"4580\">gradient ascent<\/strong> on a performance objective.<\/p>\n<p data-start=\"4612\" data-end=\"4629\">Examples include:<\/p>\n<ul data-start=\"4631\" data-end=\"4874\">\n<li data-start=\"4631\" data-end=\"4754\"><strong data-start=\"4633\" data-end=\"4646\">REINFORCE<\/strong>: Uses Monte Carlo estimates to compute the gradient of expected reward with respect to policy parameters.<\/li>\n<li data-start=\"4755\" data-end=\"4874\"><strong data-start=\"4757\" data-end=\"4795\">Actor-Only Policy Gradient Methods<\/strong>: Optimize the policy directly in both discrete and continuous action spaces.<\/li>\n<\/ul>\n<p data-start=\"4876\" data-end=\"5142\">Policy-based methods handle <strong data-start=\"4904\" data-end=\"4947\">continuous and high-dimensional actions<\/strong> better than value-based methods and can naturally represent stochastic policies, which is useful for exploration. However, they may suffer from higher variance in updates and slower convergence.<\/p>\n<h4 data-start=\"5144\" data-end=\"5169\">Actor-Critic Methods<\/h4>\n<p data-start=\"5171\" data-end=\"5276\"><strong data-start=\"5171\" data-end=\"5190\">Actor-critic RL<\/strong> combines the strengths of value-based and policy-based methods. In this architecture:<\/p>\n<ul data-start=\"5278\" data-end=\"5475\">\n<li data-start=\"5278\" data-end=\"5336\">The <strong data-start=\"5284\" data-end=\"5293\">actor<\/strong> maintains the policy, selecting actions.<\/li>\n<li data-start=\"5337\" data-end=\"5475\">The <strong data-start=\"5343\" data-end=\"5353\">critic<\/strong> evaluates the policy using a value function (state-value or action-value) and provides feedback for policy improvement.<\/li>\n<\/ul>\n<p data-start=\"5477\" data-end=\"5765\">Actor-critic methods reduce variance compared to pure policy gradients and are well-suited for complex tasks, such as robotic control, multi-agent systems, and large-scale game environments. Examples include <strong data-start=\"5685\" data-end=\"5721\">Advantage Actor-Critic (A2C\/A3C)<\/strong> and <strong data-start=\"5726\" data-end=\"5764\">Proximal Policy Optimization (PPO)<\/strong>.<\/p>\n<h3 data-start=\"5767\" data-end=\"5821\">3. On-Policy vs. Off-Policy Reinforcement Learning<\/h3>\n<p data-start=\"5823\" data-end=\"5953\">RL algorithms can also be classified based on whether the learning policy is the same as the behavior policy used for exploration.<\/p>\n<h4 data-start=\"5955\" data-end=\"5992\">On-Policy Reinforcement Learning<\/h4>\n<p data-start=\"5994\" data-end=\"6181\">In <strong data-start=\"5997\" data-end=\"6013\">on-policy RL<\/strong>, the agent learns about and improves the <strong data-start=\"6055\" data-end=\"6070\">same policy<\/strong> that it uses to interact with the environment. The agent updates its policy based on actions actually taken.<\/p>\n<p data-start=\"6183\" data-end=\"6297\">Example: <strong data-start=\"6192\" data-end=\"6201\">SARSA<\/strong> is on-policy because Q-values are updated using the next action chosen by the current policy.<\/p>\n<p data-start=\"6299\" data-end=\"6479\">On-policy methods ensure stable learning but can be slower because exploration is tied to the current policy, limiting the ability to leverage data collected from other strategies.<\/p>\n<h4 data-start=\"6481\" data-end=\"6519\">Off-Policy Reinforcement Learning<\/h4>\n<p data-start=\"6521\" data-end=\"6754\">In <strong data-start=\"6524\" data-end=\"6541\">off-policy RL<\/strong>, the agent learns an <strong data-start=\"6563\" data-end=\"6595\">optimal policy independently<\/strong> of the behavior policy used to generate data. This allows the agent to learn from historical data or exploratory actions that differ from the target policy.<\/p>\n<p data-start=\"6756\" data-end=\"6911\">Example: <strong data-start=\"6765\" data-end=\"6779\">Q-Learning<\/strong> is off-policy because updates are based on the best possible action in the next state, not necessarily the action actually taken.<\/p>\n<p data-start=\"6913\" data-end=\"7076\">Off-policy methods are more flexible and sample-efficient but can be less stable, requiring techniques like experience replay or target networks for stabilization.<\/p>\n<h3 data-start=\"7078\" data-end=\"7131\">4. Discrete vs. Continuous Reinforcement Learning<\/h3>\n<p data-start=\"7133\" data-end=\"7241\">Reinforcement learning tasks can also be categorized based on the <strong data-start=\"7199\" data-end=\"7240\">nature of the action and state spaces<\/strong>:<\/p>\n<ul data-start=\"7243\" data-end=\"7632\">\n<li data-start=\"7243\" data-end=\"7437\"><strong data-start=\"7245\" data-end=\"7260\">Discrete RL<\/strong>: Actions and states are limited and enumerable, suitable for tabular methods or classical algorithms like Q-learning and SARSA. Example: grid-world navigation or board games.<\/li>\n<li data-start=\"7438\" data-end=\"7632\"><strong data-start=\"7440\" data-end=\"7457\">Continuous RL<\/strong>: Actions or states are continuous, requiring function approximation techniques such as neural networks or policy gradients. Example: robotic arm control, autonomous vehicles.<\/li>\n<\/ul>\n<h2 data-start=\"125\" data-end=\"174\">Important Algorithms in Reinforcement Learning<\/h2>\n<p data-start=\"176\" data-end=\"767\">Reinforcement Learning (RL) has emerged as one of the most influential paradigms in artificial intelligence, enabling agents to learn optimal behavior through interaction with an environment. RL algorithms provide the mathematical and computational tools to guide agents toward maximizing cumulative rewards, often in complex and dynamic environments. Over the decades, a variety of RL algorithms have been developed, each with distinct approaches, assumptions, and application domains. Understanding these key algorithms is essential for both theoretical study and practical implementation.<\/p>\n<h3 data-start=\"769\" data-end=\"808\">1. Dynamic Programming (DP) Methods<\/h3>\n<p data-start=\"810\" data-end=\"1097\">Dynamic Programming forms the foundation of many RL algorithms. Introduced in the 1950s by <strong data-start=\"901\" data-end=\"920\">Richard Bellman<\/strong>, DP provides techniques for solving sequential decision-making problems under the assumption of a <strong data-start=\"1019\" data-end=\"1050\">perfectly known environment<\/strong>, described by a Markov Decision Process (MDP).<\/p>\n<h4 data-start=\"1099\" data-end=\"1125\">1.1 Policy Evaluation<\/h4>\n<p data-start=\"1127\" data-end=\"1270\">Policy evaluation computes the <strong data-start=\"1158\" data-end=\"1182\">state-value function<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">V\u03c0(s)V^\\pi(s)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">\u03c0<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> for a given policy <span class=\"katex\"><span class=\"katex-mathml\">\u03c0\\pi<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><\/span><\/span><\/span> using the <strong data-start=\"1237\" data-end=\"1269\">Bellman expectation equation<\/strong>:<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">V\u03c0(s)=\u2211a\u03c0(a\u2223s)\u2211s\u2032P(s\u2032\u2223s,a)[R(s,a,s\u2032)+\u03b3V\u03c0(s\u2032)]V^\\pi(s) = \\sum_{a} \\pi(a|s) \\sum_{s&#8217;} P(s&#8217;|s,a) \\big[ R(s,a,s&#8217;) + \\gamma V^\\pi(s&#8217;) \\big]<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">\u03c0<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">a<\/span><\/span><\/span><span class=\"mop op-symbol large-op\">\u2211<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">\u03c0<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"sizing reset-size3 size1 mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><span class=\"mop op-symbol large-op\">\u2211<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">P<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><span class=\"mord\"><span class=\"delimsizing size1\">[<\/span><\/span><span class=\"mord mathnormal\">R<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">\u03c0<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mord\"><span class=\"delimsizing size1\">]<\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"1369\" data-end=\"1515\">The process iteratively updates value estimates until convergence. Policy evaluation is essential for understanding the quality of a given policy.<\/p>\n<h4 data-start=\"1517\" data-end=\"1542\">1.2 Policy Iteration<\/h4>\n<p data-start=\"1544\" data-end=\"1629\">Policy iteration alternates between <strong data-start=\"1580\" data-end=\"1601\">policy evaluation<\/strong> and <strong data-start=\"1606\" data-end=\"1628\">policy improvement<\/strong>:<\/p>\n<ol data-start=\"1631\" data-end=\"1800\">\n<li data-start=\"1631\" data-end=\"1713\">Evaluate the current policy <span class=\"katex\"><span class=\"katex-mathml\">\u03c0\\pi<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03c0<\/span><\/span><\/span><\/span> using the Bellman expectation equation.<\/li>\n<li data-start=\"1714\" data-end=\"1800\">Improve the policy by acting greedily with respect to the current value function:<\/li>\n<\/ol>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">\u03c0\u2032(s)=arg\u2061max\u2061a\u2211s\u2032P(s\u2032\u2223s,a)[R(s,a,s\u2032)+\u03b3V\u03c0(s\u2032)]\\pi'(s) = \\arg\\max_a \\sum_{s&#8217;} P(s&#8217;|s,a) \\big[ R(s,a,s&#8217;) + \\gamma V^\\pi(s&#8217;) \\big]<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">\u03c0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mop\">arg<\/span><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">a<\/span><\/span><span class=\"mop\">max<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"sizing reset-size3 size1 mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><span class=\"mop op-symbol large-op\">\u2211<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">P<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><span class=\"mord\"><span class=\"delimsizing size1\">[<\/span><\/span><span class=\"mord mathnormal\">R<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">\u03c0<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mord\"><span class=\"delimsizing size1\">]<\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"1890\" data-end=\"1974\">This cycle repeats until the policy converges to the <strong data-start=\"1943\" data-end=\"1961\">optimal policy<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">\u03c0\u2217\\pi^*<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">\u03c0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>.<\/p>\n<h4 data-start=\"1976\" data-end=\"2000\">1.3 Value Iteration<\/h4>\n<p data-start=\"2002\" data-end=\"2120\">Value iteration combines policy evaluation and improvement in a single step using the <strong data-start=\"2088\" data-end=\"2119\">Bellman optimality equation<\/strong>:<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">Vk+1(s)=max\u2061a\u2211s\u2032P(s\u2032\u2223s,a)[R(s,a,s\u2032)+\u03b3Vk(s\u2032)]V_{k+1}(s) = \\max_a \\sum_{s&#8217;} P(s&#8217;|s,a) \\big[ R(s,a,s&#8217;) + \\gamma V_k(s&#8217;) \\big]<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">k<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">a<\/span><\/span><span class=\"mop\">max<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"sizing reset-size3 size1 mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><span class=\"mop op-symbol large-op\">\u2211<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">P<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><span class=\"mord\"><span class=\"delimsizing size1\">[<\/span><\/span><span class=\"mord mathnormal\">R<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">k<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mord\"><span class=\"delimsizing size1\">]<\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"2208\" data-end=\"2438\">Value iteration efficiently computes the optimal value function <span class=\"katex\"><span class=\"katex-mathml\">V\u2217(s)V^*(s)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">V<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> and corresponding policy. DP methods are powerful but require full knowledge of the environment and are computationally intensive for large state spaces.<\/p>\n<h3 data-start=\"2445\" data-end=\"2471\">2. Monte Carlo Methods<\/h3>\n<p data-start=\"2473\" data-end=\"2722\">Monte Carlo (MC) methods are <strong data-start=\"2502\" data-end=\"2516\">model-free<\/strong> RL algorithms that learn from <strong data-start=\"2547\" data-end=\"2568\">complete episodes<\/strong> of experience without requiring a model of the environment. They estimate value functions based on <strong data-start=\"2668\" data-end=\"2687\">average returns<\/strong> observed during repeated episodes.<\/p>\n<h4 data-start=\"2724\" data-end=\"2763\">2.1 First-Visit and Every-Visit MC<\/h4>\n<ul data-start=\"2765\" data-end=\"3007\">\n<li data-start=\"2765\" data-end=\"2893\"><strong data-start=\"2767\" data-end=\"2785\">First-Visit MC<\/strong>: Updates the value of a state based on the return from the first time the state is visited in an episode.<\/li>\n<li data-start=\"2894\" data-end=\"3007\"><strong data-start=\"2896\" data-end=\"2914\">Every-Visit MC<\/strong>: Updates the value of a state using returns from all visits to that state during an episode.<\/li>\n<\/ul>\n<p data-start=\"3009\" data-end=\"3247\">MC methods are simple and intuitive but are suitable only for episodic tasks, where episodes terminate in a finite number of steps. They provide unbiased estimates of value functions and serve as a foundation for more advanced RL methods.<\/p>\n<h4 data-start=\"3249\" data-end=\"3277\">2.2 Monte Carlo Control<\/h4>\n<p data-start=\"3279\" data-end=\"3565\">Monte Carlo control extends MC methods to <strong data-start=\"3321\" data-end=\"3343\">policy improvement<\/strong>, typically using an <strong data-start=\"3364\" data-end=\"3383\">\u03b5-greedy policy<\/strong> to balance exploration and exploitation. The agent updates its action-value function <span class=\"katex\"><span class=\"katex-mathml\">Q(s,a)Q(s,a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> based on observed returns and derives a greedy policy with respect to <span class=\"katex\"><span class=\"katex-mathml\">Q(s,a)Q(s,a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>.<\/p>\n<h3 data-start=\"3572\" data-end=\"3612\">3. Temporal-Difference (TD) Learning<\/h3>\n<p data-start=\"3614\" data-end=\"3909\">Temporal-Difference learning combines ideas from dynamic programming and Monte Carlo methods. Unlike MC methods, TD learning updates estimates <strong data-start=\"3757\" data-end=\"3767\">online<\/strong>, after every step, without waiting for the end of an episode. This makes TD methods suitable for <strong data-start=\"3865\" data-end=\"3885\">continuing tasks<\/strong> and large state spaces.<\/p>\n<h4 data-start=\"3911\" data-end=\"3925\">3.1 TD(0)<\/h4>\n<p data-start=\"3927\" data-end=\"4035\">TD(0) updates the state-value function <span class=\"katex\"><span class=\"katex-mathml\">V(s)V(s)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">V<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> using the observed reward and the value of the next state:<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">V(st)\u2190V(st)+\u03b1[rt+1+\u03b3V(st+1)\u2212V(st)]V(s_t) \\leftarrow V(s_t) + \\alpha \\big[ r_{t+1} + \\gamma V(s_{t+1}) &#8211; V(s_t) \\big]<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">V<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">\u2190<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">V<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b1<\/span><span class=\"mord\"><span class=\"delimsizing size1\">[<\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b3V<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mbin\">\u2212<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">V<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mord\"><span class=\"delimsizing size1\">]<\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"4127\" data-end=\"4258\">Here, <span class=\"katex\"><span class=\"katex-mathml\">\u03b1\\alpha<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b1<\/span><\/span><\/span><\/span> is the learning rate. TD(0) is simple, incremental, and capable of learning in <strong data-start=\"4225\" data-end=\"4257\">non-terminating environments<\/strong>.<\/p>\n<h4 data-start=\"4260\" data-end=\"4297\">3.2 TD(\u03bb) and Eligibility Traces<\/h4>\n<p data-start=\"4299\" data-end=\"4599\">TD(\u03bb) generalizes TD(0) and MC methods by combining short-term and long-term updates using <strong data-start=\"4390\" data-end=\"4412\">eligibility traces<\/strong>. A trace assigns credit to previously visited states, allowing rewards to propagate backward efficiently. The \u03bb parameter controls the weighting between short-term and long-term returns:<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">V(s)\u2190V(s)+\u03b1\u2211t(\u03b3\u03bb)t\u2212ts\u03b4tV(s) \\leftarrow V(s) + \\alpha \\sum_{t} (\\gamma \\lambda)^{t-t_s} \\delta_t<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">V<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">\u2190<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">V<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b1<\/span><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"mop op-symbol large-op\">\u2211<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">\u03b3\u03bb<\/span><span class=\"mclose\">)<span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">\u2212<\/span><span class=\"mord mathnormal mtight\">t<\/span><span class=\"vlist-t vlist-t2\"><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mathnormal mtight\">s<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">\u03b4<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"4681\" data-end=\"4839\">where <span class=\"katex\"><span class=\"katex-mathml\">\u03b4t=rt+1+\u03b3V(st+1)\u2212V(st)\\delta_t = r_{t+1} + \\gamma V(s_{t+1}) &#8211; V(s_t)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">\u03b4<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b3V<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mbin\">\u2212<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">V<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> is the TD error. TD(\u03bb) improves convergence speed and handles delayed reward problems effectively.<\/p>\n<h3 data-start=\"4846\" data-end=\"4863\">4. Q-Learning<\/h3>\n<p data-start=\"4865\" data-end=\"5085\">Q-Learning, introduced by <strong data-start=\"4891\" data-end=\"4908\">Chris Watkins<\/strong> in 1989, is a <strong data-start=\"4923\" data-end=\"4962\">model-free, off-policy, value-based<\/strong> algorithm. It learns the <strong data-start=\"4988\" data-end=\"5013\">action-value function<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">Q(s,a)Q(s,a)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> directly, allowing the agent to derive the optimal policy:<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">Q(st,at)\u2190Q(st,at)+\u03b1[rt+1+\u03b3max\u2061a\u2032Q(st+1,a\u2032)\u2212Q(st,at)]Q(s_t,a_t) \\leftarrow Q(s_t,a_t) + \\alpha \\big[ r_{t+1} + \\gamma \\max_{a&#8217;} Q(s_{t+1},a&#8217;) &#8211; Q(s_t,a_t) \\big]<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">\u2190<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b1<\/span><span class=\"mord\"><span class=\"delimsizing size1\">[<\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"sizing reset-size3 size1 mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><span class=\"mop\">max<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u2032<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mbin\">\u2212<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mord\"><span class=\"delimsizing size1\">]<\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"5202\" data-end=\"5229\">Key features of Q-Learning:<\/p>\n<ul data-start=\"5231\" data-end=\"5558\">\n<li data-start=\"5231\" data-end=\"5335\"><strong data-start=\"5233\" data-end=\"5247\">Off-policy<\/strong>: Learns the optimal policy independently of the behavior policy used for exploration.<\/li>\n<li data-start=\"5336\" data-end=\"5413\"><strong data-start=\"5338\" data-end=\"5352\">Model-free<\/strong>: Does not require knowledge of the environment\u2019s dynamics.<\/li>\n<li data-start=\"5414\" data-end=\"5558\"><strong data-start=\"5416\" data-end=\"5428\">Flexible<\/strong>: Can be applied to discrete action spaces, and with function approximation (e.g., neural networks), can scale to complex tasks.<\/li>\n<\/ul>\n<p data-start=\"5560\" data-end=\"5693\">Q-Learning forms the basis of <strong data-start=\"5590\" data-end=\"5615\">Deep Q-Networks (DQN)<\/strong>, which integrate deep learning to handle high-dimensional inputs like images.<\/p>\n<h3 data-start=\"5700\" data-end=\"5747\">5. SARSA (State-Action-Reward-State-Action)<\/h3>\n<p data-start=\"5749\" data-end=\"5874\">SARSA is an <strong data-start=\"5761\" data-end=\"5774\">on-policy<\/strong> version of Q-Learning, updating Q-values using the <strong data-start=\"5826\" data-end=\"5851\">action actually taken<\/strong> by the current policy:<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">Q(st,at)\u2190Q(st,at)+\u03b1[rt+1+\u03b3Q(st+1,at+1)\u2212Q(st,at)]Q(s_t,a_t) \\leftarrow Q(s_t,a_t) + \\alpha \\big[ r_{t+1} + \\gamma Q(s_{t+1},a_{t+1}) &#8211; Q(s_t,a_t) \\big]<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">\u2190<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b1<\/span><span class=\"mord\"><span class=\"delimsizing size1\">[<\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b3<\/span><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mbin\">\u2212<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">Q<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mord\"><span class=\"delimsizing size1\">]<\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"5986\" data-end=\"6183\">Unlike Q-Learning, SARSA evaluates the policy being followed, leading to more conservative behavior in risky environments. It is particularly useful in scenarios where safe exploration is critical.<\/p>\n<h3 data-start=\"6190\" data-end=\"6220\">6. Policy Gradient Methods<\/h3>\n<p data-start=\"6222\" data-end=\"6462\">Policy gradient methods <strong data-start=\"6246\" data-end=\"6278\">directly optimize the policy<\/strong> without relying on value functions. The agent parametrizes the policy <span class=\"katex\"><span class=\"katex-mathml\">\u03c0\u03b8(a\u2223s)\\pi_\\theta(a|s)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">\u03c0<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">\u03b8<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mord\">\u2223<\/span><span class=\"mord mathnormal\">s<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span> and updates parameters <span class=\"katex\"><span class=\"katex-mathml\">\u03b8\\theta<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b8<\/span><\/span><\/span><\/span> to maximize expected cumulative reward <span class=\"katex\"><span class=\"katex-mathml\">J(\u03b8)J(\\theta)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">J<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">\u03b8<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span>:<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">\u03b8\u2190\u03b8+\u03b1\u2207\u03b8J(\u03b8)\\theta \\leftarrow \\theta + \\alpha \\nabla_\\theta J(\\theta)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b8<\/span><span class=\"mrel\">\u2190<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b8<\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b1<\/span><span class=\"mord\">\u2207<span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">\u03b8<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">J<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">\u03b8<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span><\/p>\n<h4 data-start=\"6529\" data-end=\"6557\">6.1 REINFORCE Algorithm<\/h4>\n<p data-start=\"6559\" data-end=\"6663\">REINFORCE is a Monte Carlo policy gradient method. The update is based on the observed return <span class=\"katex\"><span class=\"katex-mathml\">GtG_t<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">G<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>:<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">\u03b8\u2190\u03b8+\u03b1Gt\u2207\u03b8log\u2061\u03c0\u03b8(at\u2223st)\\theta \\leftarrow \\theta + \\alpha G_t \\nabla_\\theta \\log \\pi_\\theta(a_t|s_t)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03b8<\/span><span class=\"mrel\">\u2190<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b8<\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b1<\/span><span class=\"mord\"><span class=\"mord mathnormal\">G<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mord\">\u2207<span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">\u03b8<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mop\">log<\/span><span class=\"mord\"><span class=\"mord mathnormal\">\u03c0<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">\u03b8<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mord\">\u2223<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"6749\" data-end=\"6904\">REINFORCE allows for <strong data-start=\"6770\" data-end=\"6793\">stochastic policies<\/strong>, suitable for continuous action spaces, but suffers from high variance in updates, which can slow convergence.<\/p>\n<h3 data-start=\"6911\" data-end=\"6938\">7. Actor-Critic Methods<\/h3>\n<p data-start=\"6940\" data-end=\"7017\">Actor-critic methods combine <strong data-start=\"6969\" data-end=\"6985\">policy-based<\/strong> and <strong data-start=\"6990\" data-end=\"7005\">value-based<\/strong> approaches:<\/p>\n<ul data-start=\"7019\" data-end=\"7172\">\n<li data-start=\"7019\" data-end=\"7075\"><strong data-start=\"7021\" data-end=\"7030\">Actor<\/strong>: Maintains the policy and selects actions.<\/li>\n<li data-start=\"7076\" data-end=\"7172\"><strong data-start=\"7078\" data-end=\"7088\">Critic<\/strong>: Evaluates the policy using a value function and provides feedback for improvement.<\/li>\n<\/ul>\n<p data-start=\"7174\" data-end=\"7246\">The <strong data-start=\"7178\" data-end=\"7190\">TD error<\/strong> <span class=\"katex\"><span class=\"katex-mathml\">\u03b4t\\delta_t<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">\u03b4<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> is used to update both actor and critic:<\/p>\n<p><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">\u03b4t=rt+1+\u03b3V(st+1)\u2212V(st)\\delta_t = r_{t+1} + \\gamma V(s_{t+1}) &#8211; V(s_t)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">\u03b4<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">r<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03b3V<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">t<\/span><span class=\"mbin mtight\">+<\/span>1<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mbin\">\u2212<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">V<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">s<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">t<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span><\/p>\n<p data-start=\"7303\" data-end=\"7514\">Actor-critic methods reduce variance compared to pure policy gradients and are widely used in modern RL, including algorithms like <strong data-start=\"7434\" data-end=\"7470\">Advantage Actor-Critic (A2C\/A3C)<\/strong> and <strong data-start=\"7475\" data-end=\"7513\">Proximal Policy Optimization (PPO)<\/strong>.<\/p>\n<h3 data-start=\"7521\" data-end=\"7565\">8. Deep Reinforcement Learning (Deep RL)<\/h3>\n<p data-start=\"7567\" data-end=\"7741\">Deep RL combines RL algorithms with <strong data-start=\"7603\" data-end=\"7627\">deep neural networks<\/strong> for function approximation, enabling learning in <strong data-start=\"7677\" data-end=\"7710\">high-dimensional state spaces<\/strong> such as images or sensor data.<\/p>\n<h4 data-start=\"7743\" data-end=\"7773\">8.1 Deep Q-Networks (DQN)<\/h4>\n<p data-start=\"7775\" data-end=\"7881\">DQN extends Q-Learning using a deep neural network to approximate the Q-function. Key innovations include:<\/p>\n<ul data-start=\"7883\" data-end=\"8084\">\n<li data-start=\"7883\" data-end=\"7984\"><strong data-start=\"7885\" data-end=\"7906\">Experience Replay<\/strong>: Stores past experiences to break correlations between consecutive samples.<\/li>\n<li data-start=\"7985\" data-end=\"8084\"><strong data-start=\"7987\" data-end=\"8005\">Target Network<\/strong>: Stabilizes learning by using a separate network to compute target Q-values.<\/li>\n<\/ul>\n<p data-start=\"8086\" data-end=\"8198\">DQN achieved human-level performance in Atari games, demonstrating the power of combining deep learning with RL.<\/p>\n<h4 data-start=\"8200\" data-end=\"8250\">8.2 Deep Deterministic Policy Gradient (DDPG)<\/h4>\n<p data-start=\"8252\" data-end=\"8492\">DDPG is an actor-critic algorithm designed for <strong data-start=\"8299\" data-end=\"8327\">continuous action spaces<\/strong>. The actor network outputs actions directly, while the critic evaluates them. DDPG incorporates experience replay and target networks for stability and convergence.<\/p>\n<h4 data-start=\"8494\" data-end=\"8537\">8.3 Proximal Policy Optimization (PPO)<\/h4>\n<p data-start=\"8539\" data-end=\"8770\">PPO improves policy gradient stability using <strong data-start=\"8584\" data-end=\"8615\">clipped objective functions<\/strong>, balancing policy improvement and constraint enforcement. PPO is widely used in robotics, simulation, and large-scale reinforcement learning applications.<\/p>\n<h3 data-start=\"8777\" data-end=\"8825\">9. Multi-Agent Reinforcement Learning (MARL)<\/h3>\n<p data-start=\"8827\" data-end=\"8984\">MARL extends RL to environments with <strong data-start=\"8864\" data-end=\"8895\">multiple interacting agents<\/strong>, where each agent learns a policy considering the actions of others. Algorithms include:<\/p>\n<ul data-start=\"8986\" data-end=\"9248\">\n<li data-start=\"8986\" data-end=\"9067\"><strong data-start=\"8988\" data-end=\"9014\">Independent Q-Learning<\/strong>: Each agent learns independently using Q-Learning.<\/li>\n<li data-start=\"9068\" data-end=\"9149\"><strong data-start=\"9070\" data-end=\"9090\">Cooperative MARL<\/strong>: Agents share information to maximize collective reward.<\/li>\n<li data-start=\"9150\" data-end=\"9248\"><strong data-start=\"9152\" data-end=\"9172\">Competitive MARL<\/strong>: Agents learn in adversarial environments, such as games or negotiations.<\/li>\n<\/ul>\n<p data-start=\"9250\" data-end=\"9370\">MARL is critical for real-world scenarios like autonomous vehicle fleets, multi-robot coordination, and strategic games.<\/p>\n<h3 data-start=\"9377\" data-end=\"9426\">10. Hierarchical Reinforcement Learning (HRL)<\/h3>\n<p data-start=\"9428\" data-end=\"9696\">HRL decomposes complex tasks into <strong data-start=\"9462\" data-end=\"9474\">subtasks<\/strong> or <strong data-start=\"9478\" data-end=\"9489\">options<\/strong>, allowing agents to learn policies at multiple levels of abstraction. Techniques like <strong data-start=\"9576\" data-end=\"9597\">Options Framework<\/strong> and <strong data-start=\"9602\" data-end=\"9615\">Feudal RL<\/strong> improve learning efficiency and scalability in large or structured environments.<\/p>\n<h2 data-start=\"136\" data-end=\"203\">Exploration vs. Exploitation Trade-off in Reinforcement Learning<\/h2>\n<p data-start=\"205\" data-end=\"735\">One of the central challenges in reinforcement learning (RL) is the <strong data-start=\"273\" data-end=\"315\">exploration vs. exploitation trade-off<\/strong>, a fundamental dilemma faced by an agent when deciding how to act in an environment. In simple terms, the agent must choose between <strong data-start=\"448\" data-end=\"462\">exploiting<\/strong> known actions that yield high rewards and <strong data-start=\"505\" data-end=\"518\">exploring<\/strong> new or less-certain actions that might lead to even higher rewards in the future. Striking the right balance between these two strategies is crucial for efficient learning and achieving optimal long-term performance.<\/p>\n<h3 data-start=\"737\" data-end=\"767\">Understanding Exploitation<\/h3>\n<p data-start=\"769\" data-end=\"989\"><strong data-start=\"769\" data-end=\"785\">Exploitation<\/strong> refers to the agent\u2019s use of its current knowledge to maximize immediate reward. When exploiting, the agent selects the action that it believes has the highest expected value based on past experiences.<\/p>\n<p data-start=\"991\" data-end=\"1427\">For example, consider a robot navigating a maze where it has previously discovered a path to the goal. Exploiting its knowledge, the robot will continue following the known path because it guarantees a high reward. Exploitation is essential for <strong data-start=\"1236\" data-end=\"1255\">short-term gain<\/strong> and ensures the agent benefits from the information it has already gathered. Without exploitation, an agent may wander randomly, failing to accumulate rewards effectively.<\/p>\n<p data-start=\"1429\" data-end=\"1777\">However, excessive exploitation can be <strong data-start=\"1468\" data-end=\"1499\">suboptimal in the long term<\/strong>. The agent may miss better strategies or more rewarding actions simply because it is too focused on what it already knows. In RL, this limitation is often referred to as being trapped in a <strong data-start=\"1689\" data-end=\"1706\">local optimum<\/strong>, where the agent\u2019s current knowledge is good but not globally optimal.<\/p>\n<h3 data-start=\"1779\" data-end=\"1808\">Understanding Exploration<\/h3>\n<p data-start=\"1810\" data-end=\"2123\"><strong data-start=\"1810\" data-end=\"1825\">Exploration<\/strong> involves choosing actions that may not appear immediately rewarding but could provide valuable information about the environment. By exploring, the agent gathers data about unknown states and actions, which may improve its understanding of the environment and lead to better long-term decisions.<\/p>\n<p data-start=\"2125\" data-end=\"2516\">Returning to the maze example, exploration occurs when the robot tries a different route, even if it seems longer or uncertain. This strategy allows the agent to discover shortcuts or higher-reward paths that were previously unknown. Exploration is particularly important in <strong data-start=\"2400\" data-end=\"2438\">dynamic or stochastic environments<\/strong>, where conditions may change over time, and the agent must continually adapt.<\/p>\n<p data-start=\"2518\" data-end=\"2851\">The challenge with exploration is that it can involve <strong data-start=\"2572\" data-end=\"2606\">temporary sacrifices in reward<\/strong>. Choosing unknown actions may lead to low or negative rewards in the short term, making the agent\u2019s performance appear worse initially. Thus, exploration introduces <strong data-start=\"2772\" data-end=\"2780\">risk<\/strong> but is necessary to achieve higher cumulative rewards in the long run.<\/p>\n<h3 data-start=\"2853\" data-end=\"2895\">Balancing Exploration and Exploitation<\/h3>\n<p data-start=\"2897\" data-end=\"3164\">Effective reinforcement learning requires a <strong data-start=\"2941\" data-end=\"2960\">careful balance<\/strong> between exploration and exploitation. Too much exploitation may prevent the agent from discovering better strategies, while excessive exploration may waste time and resources without sufficient reward.<\/p>\n<p data-start=\"3166\" data-end=\"3228\">Several strategies are commonly used to manage this trade-off:<\/p>\n<ol data-start=\"3230\" data-end=\"4124\">\n<li data-start=\"3230\" data-end=\"3492\"><strong data-start=\"3233\" data-end=\"3254\">\u03b5-Greedy Strategy<\/strong>: With probability <span class=\"katex\"><span class=\"katex-mathml\">\u03f5\\epsilon<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03f5<\/span><\/span><\/span><\/span>, the agent explores randomly, and with probability <span class=\"katex\"><span class=\"katex-mathml\">1\u2212\u03f51-\\epsilon<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\">1<\/span><span class=\"mbin\">\u2212<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">\u03f5<\/span><\/span><\/span><\/span>, it exploits the best-known action. Gradually decreasing <span class=\"katex\"><span class=\"katex-mathml\">\u03f5\\epsilon<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">\u03f5<\/span><\/span><\/span><\/span> over time allows the agent to explore early and exploit later.<\/li>\n<li data-start=\"3494\" data-end=\"3715\"><strong data-start=\"3497\" data-end=\"3525\">Softmax Action Selection<\/strong>: The agent selects actions probabilistically based on their estimated value, giving higher-probability actions more weight but still allowing lower-value actions to be tried occasionally.<\/li>\n<li data-start=\"3717\" data-end=\"3953\"><strong data-start=\"3720\" data-end=\"3752\">Upper Confidence Bound (UCB)<\/strong>: Balances exploration and exploitation by considering both the estimated value of an action and the uncertainty associated with it. Actions with higher uncertainty are given a chance to be explored.<\/li>\n<li data-start=\"3955\" data-end=\"4124\"><strong data-start=\"3958\" data-end=\"3979\">Thompson Sampling<\/strong>: Uses a Bayesian approach to select actions according to their probability of being optimal, naturally balancing exploration and exploitation.<\/li>\n<\/ol>\n<h3 data-start=\"4126\" data-end=\"4166\">Importance in Reinforcement Learning<\/h3>\n<p data-start=\"4168\" data-end=\"4456\">The exploration-exploitation trade-off is crucial because RL is fundamentally about <strong data-start=\"4252\" data-end=\"4281\">learning from interaction<\/strong>. Optimal policies cannot be learned if the agent never explores unknown states, and immediate rewards cannot be maximized if the agent never exploits its current knowledge.<\/p>\n<p data-start=\"4458\" data-end=\"4831\">In complex environments, like robotics, autonomous driving, or strategy games, balancing exploration and exploitation affects <strong data-start=\"4584\" data-end=\"4651\">learning efficiency, convergence speed, and overall performance<\/strong>. Algorithms that handle this trade-off well, such as \u03b5-greedy Q-learning or PPO with stochastic policies, are more robust and capable of achieving near-optimal long-term behavior.<\/p>\n<\/div>\n<h2 data-start=\"92\" data-end=\"133\">Reward Systems and Policy Optimization<\/h2>\n<p data-start=\"135\" data-end=\"731\">Reward systems are foundational to decision-making frameworks in both artificial intelligence (AI) and organizational management. At their core, reward systems define what behaviors are desirable by providing positive or negative reinforcement, shaping future actions. In AI, particularly in reinforcement learning (RL), reward systems are critical for guiding agents toward optimal policies\u2014sequences of actions that maximize cumulative rewards. In organizational contexts, reward systems similarly influence human behavior, affecting productivity, innovation, and long-term strategic alignment.<\/p>\n<h3 data-start=\"733\" data-end=\"782\">The Role of Reward Systems in Decision-Making<\/h3>\n<p data-start=\"784\" data-end=\"1399\">A reward system serves as the primary feedback mechanism in both human and artificial agents. In reinforcement learning, an agent interacts with an environment by taking actions that transition it from one state to another. After each action, the agent receives a reward signal\u2014a numerical value representing the immediate benefit or cost associated with that action. The overarching goal is to maximize the expected cumulative reward over time. This process mirrors human and organizational behavior, where incentives such as promotions, bonuses, or recognition encourage actions aligned with strategic objectives.<\/p>\n<p data-start=\"1401\" data-end=\"1907\">Reward systems must be carefully designed to avoid unintended consequences. For example, overly simplistic reward structures may encourage short-term gains at the expense of long-term outcomes, a phenomenon often referred to as &#8220;reward hacking.&#8221; In both AI and organizational settings, misaligned rewards can lead to behavior that maximizes the reward metric while undermining broader objectives. Thus, understanding the underlying incentives and ensuring they are aligned with desired outcomes is crucial.<\/p>\n<h3 data-start=\"1909\" data-end=\"1958\">Policy Optimization in Reinforcement Learning<\/h3>\n<p data-start=\"1960\" data-end=\"2338\">Policy optimization is the process of finding the best policy that maximizes expected rewards. A policy, in reinforcement learning terminology, defines the agent\u2019s strategy: a mapping from observed states of the environment to actions. The quality of a policy is measured by its expected return, which is the cumulative sum of discounted rewards the agent anticipates receiving.<\/p>\n<p data-start=\"2340\" data-end=\"3004\">There are two main approaches to policy optimization: <strong data-start=\"2394\" data-end=\"2417\">value-based methods<\/strong> and <strong data-start=\"2422\" data-end=\"2446\">policy-based methods<\/strong>. Value-based methods, such as Q-learning, involve estimating the value function, which predicts the expected return for taking a particular action in a given state. Once the value function is learned, the optimal policy is derived by selecting actions that maximize expected value. Policy-based methods, by contrast, directly parameterize the policy and adjust it iteratively to improve performance. Techniques such as <strong data-start=\"2866\" data-end=\"2893\">policy gradient methods<\/strong> allow for continuous action spaces and more nuanced strategies, making them suitable for complex environments.<\/p>\n<p data-start=\"3006\" data-end=\"3446\">Hybrid approaches, such as <strong data-start=\"3033\" data-end=\"3057\">actor-critic methods<\/strong>, combine value-based and policy-based strategies. The &#8220;actor&#8221; updates the policy, while the &#8220;critic&#8221; evaluates the actions using a value function, providing more stable and efficient learning. These methods highlight the interplay between reward systems and policy optimization: the structure of the reward signal directly influences how efficiently the agent can learn an optimal policy.<\/p>\n<h2 data-start=\"98\" data-end=\"139\">Applications of Reinforcement Learning<\/h2>\n<p data-start=\"141\" data-end=\"850\">Reinforcement Learning (RL) is a powerful subset of machine learning where agents learn to make sequential decisions by interacting with an environment to maximize cumulative rewards. Unlike supervised learning, which relies on labeled datasets, RL agents learn through trial and error, receiving feedback from their actions. Over the past decade, RL has evolved from a theoretical framework to a practical tool, finding applications across diverse domains, including robotics, healthcare, finance, gaming, natural language processing, and autonomous systems. This exploration highlights key applications, illustrating how RL\u2019s ability to optimize sequential decisions is transforming technology and industry.<\/p>\n<h3 data-start=\"852\" data-end=\"890\">1. Robotics and Autonomous Systems<\/h3>\n<p data-start=\"892\" data-end=\"1382\">One of the most natural applications of RL is in robotics, where agents must make real-time decisions in dynamic and uncertain environments. RL enables robots to learn complex motor skills through trial-and-error interactions. For example, robotic arms can learn to grasp and manipulate objects of varying shapes and sizes without explicit programming. Through reward signals, robots adapt to succeed in tasks such as stacking blocks, assembling parts, or performing surgery with precision.<\/p>\n<p data-start=\"1384\" data-end=\"1888\">Autonomous vehicles, including self-driving cars and drones, also leverage RL for navigation, route optimization, and obstacle avoidance. For instance, RL algorithms allow vehicles to learn optimal driving policies by simulating millions of driving scenarios, balancing safety, efficiency, and passenger comfort. Techniques such as <strong data-start=\"1716\" data-end=\"1741\">Deep Q-Networks (DQN)<\/strong> and <strong data-start=\"1746\" data-end=\"1784\">Proximal Policy Optimization (PPO)<\/strong> are commonly used to train these systems, enabling real-world deployment in complex urban environments.<\/p>\n<h3 data-start=\"1890\" data-end=\"1921\">2. Gaming and Entertainment<\/h3>\n<p data-start=\"1923\" data-end=\"2192\">Gaming has historically been a benchmark for RL research, providing a controlled yet complex environment for testing algorithms. RL has achieved superhuman performance in games ranging from classic board games like chess and Go to modern video games like StarCraft II.<\/p>\n<p data-start=\"2194\" data-end=\"2602\">A famous example is <strong data-start=\"2214\" data-end=\"2225\">AlphaGo<\/strong>, developed by DeepMind, which combined RL with deep neural networks to defeat human world champions in Go\u2014a game previously considered intractable for machines due to its immense search space. Similarly, RL has been used in video games to create adaptive AI opponents that adjust strategies dynamically, providing challenging experiences for players and enhancing game design.<\/p>\n<p data-start=\"2604\" data-end=\"2818\">Beyond competitive games, RL contributes to interactive entertainment, such as personalized game content, adaptive difficulty levels, and recommendation systems for users, optimizing engagement and user experience.<\/p>\n<h3 data-start=\"2820\" data-end=\"2859\">3. Healthcare and Medical Treatment<\/h3>\n<p data-start=\"2861\" data-end=\"3357\">Reinforcement learning has significant potential in healthcare, particularly in personalized treatment planning and decision support. Medical decision-making often involves sequential choices under uncertainty, making RL an ideal framework. For instance, RL can help optimize chemotherapy schedules, balancing the efficacy of treatment with minimizing side effects. By simulating patient responses over time, RL agents can suggest individualized treatment policies that maximize patient outcomes.<\/p>\n<p data-start=\"3359\" data-end=\"3848\">RL also contributes to adaptive interventions in chronic disease management, such as insulin dosing for diabetes or rehabilitation protocols after surgery. Clinical decision-support systems use RL to recommend treatments that improve long-term health outcomes rather than just immediate symptom relief. Moreover, RL can assist in drug discovery by guiding molecular design toward compounds with desired therapeutic properties, effectively accelerating the research and development process.<\/p>\n<h3 data-start=\"3850\" data-end=\"3876\">4. Finance and Trading<\/h3>\n<p data-start=\"3878\" data-end=\"4339\">Financial markets are inherently dynamic, uncertain, and sequential in nature, making them a suitable domain for RL applications. RL algorithms can optimize trading strategies by continuously learning from market data to maximize returns while controlling risk. For example, RL agents can determine optimal asset allocations, manage portfolios, and perform algorithmic trading by modeling market conditions as states and actions as buy, sell, or hold decisions.<\/p>\n<p data-start=\"4341\" data-end=\"4670\">Risk management and fraud detection also benefit from RL, where agents learn to predict and mitigate financial losses under uncertainty. Techniques like <strong data-start=\"4494\" data-end=\"4539\">Deep Deterministic Policy Gradient (DDPG)<\/strong> and <strong data-start=\"4544\" data-end=\"4568\">actor-critic methods<\/strong> are commonly applied to manage high-dimensional continuous action spaces inherent in trading systems.<\/p>\n<h3 data-start=\"4672\" data-end=\"4728\">5. Natural Language Processing and Conversational AI<\/h3>\n<p data-start=\"4730\" data-end=\"5052\">Reinforcement learning plays an increasingly important role in natural language processing (NLP), especially in developing conversational AI and language models. In these applications, RL optimizes interactions based on human feedback, ensuring that responses are contextually relevant, safe, and aligned with user goals.<\/p>\n<p data-start=\"5054\" data-end=\"5500\">For example, RL with human feedback (RLHF) has been critical in fine-tuning large language models, guiding the model to generate more helpful, factual, and coherent outputs. Chatbots, virtual assistants, and recommendation systems also use RL to optimize dialogue strategies, learning from user engagement and satisfaction metrics. By framing conversation as a sequential decision-making problem, RL ensures agents improve interactions over time.<\/p>\n<h3 data-start=\"5502\" data-end=\"5539\">6. Energy Systems and Smart Grids<\/h3>\n<p data-start=\"5541\" data-end=\"5908\">Reinforcement learning is increasingly applied in energy management and sustainability efforts. Smart grids, which integrate renewable energy sources, require dynamic control of energy distribution and consumption. RL algorithms optimize energy storage, load balancing, and demand response strategies, reducing operational costs while minimizing environmental impact.<\/p>\n<p data-start=\"5910\" data-end=\"6366\">For instance, RL can learn to schedule battery charging and discharging in response to fluctuating electricity prices or renewable energy availability. Similarly, RL helps optimize heating, ventilation, and air conditioning (HVAC) systems in buildings, reducing energy consumption while maintaining occupant comfort. These applications demonstrate RL\u2019s capability to handle complex, dynamic, and multi-objective optimization problems in real-world systems.<\/p>\n<h3 data-start=\"6368\" data-end=\"6426\">7. Industrial Automation and Supply Chain Optimization<\/h3>\n<p data-start=\"6428\" data-end=\"6757\">Industrial operations and supply chain management are inherently sequential decision problems, making them ideal for RL. In manufacturing, RL can optimize production schedules, machine maintenance, and inventory management. By learning from operational data, RL agents can minimize downtime, reduce waste, and improve throughput.<\/p>\n<p data-start=\"6759\" data-end=\"7178\">In logistics, RL enhances route optimization for delivery vehicles, warehouse management, and inventory replenishment. For example, agents can dynamically adapt delivery routes based on traffic conditions and order priorities, minimizing costs and improving customer satisfaction. These applications highlight RL\u2019s ability to optimize complex operational systems where decisions at one stage affect downstream outcomes.<\/p>\n<h3 data-start=\"7180\" data-end=\"7229\">8. Personalized Recommendations and Marketing<\/h3>\n<p data-start=\"7231\" data-end=\"7640\">Reinforcement learning enhances personalization in e-commerce, streaming platforms, and digital advertising. Traditional recommendation systems often rely on historical preferences, but RL enables dynamic adaptation to user behavior over time. By modeling user interactions as sequential decisions, RL agents learn to suggest products, content, or advertisements that maximize long-term engagement or revenue.<\/p>\n<p data-start=\"7642\" data-end=\"8057\">For instance, streaming platforms use RL to optimize content recommendations, balancing user satisfaction with retention metrics. In advertising, RL determines the most effective ad placements, bid strategies, and targeting approaches, maximizing the return on investment while minimizing user fatigue. The sequential and adaptive nature of RL allows these systems to continuously improve as more data is collected.<\/p>\n<h3 data-start=\"8059\" data-end=\"8110\">9. Autonomous Research and Scientific Discovery<\/h3>\n<p data-start=\"8112\" data-end=\"8451\">Reinforcement learning is also applied in domains requiring exploration of complex solution spaces. In scientific research, RL assists in designing experiments, optimizing chemical reactions, or discovering new materials. By simulating outcomes and learning from iterative trials, RL reduces the cost and time required for experimentation.<\/p>\n<p data-start=\"8453\" data-end=\"8754\">In computational biology, RL can guide protein folding predictions, genetic engineering strategies, or synthetic biology applications. Similarly, in environmental science, RL helps optimize resource allocation for conservation efforts, managing ecosystems, or predicting climate-related interventions.<\/p>\n<h3 data-start=\"9630\" data-end=\"9644\">Conclusion<\/h3>\n<p data-start=\"9646\" data-end=\"10299\">Reinforcement learning has evolved from a theoretical concept into a versatile tool capable of solving complex, sequential decision-making problems across industries. Its applications span robotics, gaming, healthcare, finance, energy management, industrial automation, personalized recommendations, and scientific discovery. By enabling agents to learn from experience and optimize behavior over time, RL continues to reshape how systems adapt, make decisions, and improve performance. As research advances and computational resources grow, reinforcement learning is poised to play an even more transformative role in technology, industry, and society.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Reinforcement Learning (RL) is a dynamic area of machine learning that focuses on how agents should take actions in an environment to maximize cumulative rewards. Unlike supervised learning, where models are trained on labeled datasets, RL involves learning from interaction, trial-and-error, and delayed feedback. It draws inspiration from behavioral psychology, where learning is driven [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-7514","post","type-post","status-publish","format-standard","hentry","category-technical-how-to"],"_links":{"self":[{"href":"https:\/\/lite16.com\/blog\/wp-json\/wp\/v2\/posts\/7514","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lite16.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lite16.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lite16.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/lite16.com\/blog\/wp-json\/wp\/v2\/comments?post=7514"}],"version-history":[{"count":1,"href":"https:\/\/lite16.com\/blog\/wp-json\/wp\/v2\/posts\/7514\/revisions"}],"predecessor-version":[{"id":7515,"href":"https:\/\/lite16.com\/blog\/wp-json\/wp\/v2\/posts\/7514\/revisions\/7515"}],"wp:attachment":[{"href":"https:\/\/lite16.com\/blog\/wp-json\/wp\/v2\/media?parent=7514"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lite16.com\/blog\/wp-json\/wp\/v2\/categories?post=7514"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lite16.com\/blog\/wp-json\/wp\/v2\/tags?post=7514"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}