Markov, Monte Carlo, TD: the Roots Under RL

You keep forgetting these because they are taught as dry definitions with no picture and no thread. Here they are as one story, each idea a visible shape you can picture and reason from, finally seen living inside every RL algorithm you already know.

Markov, Monte Carlo, TD
01

Why These Refuse to Stick

Markov property, Monte Carlo, temporal difference. You have read them five times and they evaporate. The reason is not you. It is that they are usually presented as three disconnected definitions, each in its own notation, with no anchor to anything you can see. Definitions without pictures do not survive.

So we will do the opposite. One running world, drawn the same way every time. One question asked of it: how good is each situation? Three answers to that one question, which turn out to be three points on a single dial. By the end, "Markov", "Monte Carlo", and "TD" will not be vocabulary. They will be three things you can sketch on a napkin and reason about from the picture alone.

The One Thread Through Everything
The spine of this blog. The Markov property is the assumption that makes the future computable. A Markov chain is a world that obeys it. An MDP adds choices and rewards. Then two ways to learn how good things are: Monte Carlo waits for the whole story, TD updates after one step. They are the two ends of one dial, and every RL algorithm you know sits somewhere on it.
02

The Markov Property: the Present Is Enough

Picture a board game where your next move depends only on the square you are standing on, not on the path you took to get there. It does not matter whether you arrived by a long detour or a lucky shortcut. Standing on "GO TO JAIL" sends you to jail regardless of your history. That is the Markov property: the present square holds everything you need to know the future.

Stated precisely, a process is Markov if the future is conditionally independent of the past given the present:

\mathbb{P}\big(S_{t+1} \mid S_t, S_{t-1}, \dots, S_0\big) \;=\; \mathbb{P}\big(S_{t+1} \mid S_t\big)
The entire history collapses into the current state. The state is a sufficient statistic for the future.

The phrase that unlocks it: the state is a sufficient statistic for the future. If knowing where you are tells you everything that the full history would have told you about what comes next, the property holds. This is not a law of nature, it is a property of how you define the state. Choose the state badly and the property breaks; enrich the state and it returns.

History Erased: Two Paths, One Future
Two travelers reach the same square by completely different routes. Under the Markov property their futures are drawn from the identical distribution: the square forgot how they arrived. The greyed out trails are the discarded history. Everything the future needs is printed on the current square.
BUT WAIT Real life clearly depends on history. Stock prices depend on trends, a conversation depends on what was said. So isn't the Markov property just false for everything interesting?

The property is not false, it is a contract you satisfy by choosing the state correctly. The trick is to fold whatever history matters into the present.

Stock price depends on the trend? Then a single price is a bad state. Make the state the price plus recent velocity and volatility, and now the future depends only on that enriched present: Markov restored. A conversation depends on what was said? A single last word is a bad state. Make the state the full dialogue so far, and the next word depends only on that present. This is exactly what a transformer does: its context window IS the Markov state, which is why autoregressive generation is a Markov chain over sequences even though language is full of long range dependence.

The deep lesson, and the one interviewers want: Markov is a property of the state representation, not of the world. Any process can be made Markov by enriching the state enough, in the limit by stuffing the entire history into it. The art of RL and sequence modeling is choosing a state rich enough to be Markov but small enough to learn from. When the true state is hidden and you only see partial clues, you have a POMDP, and the standard fix is to rebuild an approximate state from history, with an RNN, a transformer, or a belief distribution.

03

Markov Chains: a World That Obeys the Rule

A Markov chain is just a set of states and the probabilities of hopping between them, drawn as arrows. Our running world for the whole blog, a cartoon of a student's day:

The Running World: a Student's Day as a Chain
Five states. From Study you go to Pass with probability 0.7 or slip to Distracted with 0.3. From Distracted you recover to Study (0.5) or drift to Phone (0.5). Sleep is terminal, the day ends. Each arrow is a transition probability; every arrow leaving a state sums to 1. This is the whole chain: states plus arrows, nothing hidden, the present square is always enough.

Two questions a chain answers, both by pure bookkeeping. Where will I likely be in k steps? Multiply the transition matrix by itself k times. Where do I end up in the long run? The stationary distribution, the fixed point where the flow in equals the flow out, the same eigenvector idea behind PageRank ranking web pages by a random surfer's long run visit frequency. A chain has no rewards and no choices yet. It is the empty stage. Next we put an actor on it.

04

Adding Choices and Rewards: the MDP

A Markov chain is a world that happens to you. A Markov Decision Process is a world you act in. Three additions turn the empty stage into the setting of all RL:

Actions. In each state you choose, and your choice bends the transition probabilities. Rewards. Each transition pays a number. A policy π. Your strategy: which action to take in each state. The chain's fixed arrows become a menu of arrows you select among.

\text{MDP} = \big(\,\mathcal{S},\ \mathcal{A},\ P(s' \mid s, a),\ R(s, a),\ \gamma\,\big)
States, actions, transition probabilities now conditioned on the action, rewards, and the discount γ. Fix a policy π and the MDP collapses back into a plain Markov chain, the one your strategy induces.

That last sentence is the bridge nobody points out: a policy turns an MDP back into a Markov chain. Choose how you act, and the menu of arrows becomes a single set of arrows again, with rewards attached. Evaluating "how good is my strategy" is therefore a question about a Markov chain with payouts. That is precisely the question Monte Carlo and TD answer.

From Chain to MDP: Arrows Become Choices
The student world upgraded. In Study you now choose: Grind (high chance of Pass, small reward now) or Break (pleasant now, risk of sliding to Distracted). Each action carries its own arrows and its own reward number. A policy picks one action per state, which selects one arrow bundle, which is again a Markov chain, now with rewards on every edge. The whole apparatus of RL is built to find the policy whose induced chain pays the most.
05

The Quantity Everyone Wants: the Value

One question rules them all: how good is it to be in state s under policy π? Good means how much total reward you expect to collect from here to the end. The total, discounted, is the return:

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots,\qquad V^\pi(s) = \mathbb{E}_\pi\big[\,G_t \mid S_t = s\,\big]
G is one rollout's total discounted reward, a random number. V is its expectation, the true value. This is the exact G_t and V from your PPO and Q-learning blogs.

The problem: V is an expectation, an average over all the random futures the chain could produce. We cannot see expectations. We can only see samples, individual days that actually happened. The entire subject of this blog is the two ways to climb from samples we can see to the expectation we want. Hold a concrete number in mind: suppose from Study the true value is V(Study) = 6.0. Neither method knows that. Both are trying to discover it from experience.

Value Is the Average Over Every Possible Day
From Study, three different days unfold by chance, each with its own total return: 8, 6, and 4. The value V(Study) is their probability weighted average, here 6.0. The catch that defines the field: you never observe the average, only one day at a time. Monte Carlo and TD are two strategies for recovering the hidden average from visible days.
06

Monte Carlo: Live the Whole Day, Then Average

Want to know if a restaurant is good? Eat there many times, write down how much you enjoyed each full visit, average the scores. Do not guess halfway through the meal. Wait until you walk out, then record the total. That is the whole of Monte Carlo.

Formally, Monte Carlo estimates a value by averaging complete returns. Play an episode to termination, compute its actual G, and nudge the estimate toward it:

V(s) \;\leftarrow\; V(s) + \alpha\,\big[\,G_t - V(s)\,\big]
G_t is the real, observed total return of this episode. No model, no bootstrapping, just "pull the estimate toward what actually happened".

Run it on the world. The estimate starts at V(Study) = 0. Three full days are lived to Sleep, with observed total returns 8, then 6, then 4. With α = 0.5: after day one V = 0 + 0.5(8 − 0) = 4.0. After day two V = 4 + 0.5(6 − 4) = 5.0. After day three V = 5 + 0.5(4 − 5) = 4.5, circling the true 6.0, and a running average would land on it exactly.

The character of Monte Carlo, the part interviewers want: unbiased but high variance. Each G is a true sample of the return, so the average is correct in expectation, no systematic error. But one day's total swings wildly because it accumulates every random coin flip along the entire episode. And it has a hard requirement: the episode must end. No ending, no return, no update. You cannot Monte Carlo an endless process.

Monte Carlo: Wait for the Ending, Then Learn
Three complete days from Study to Sleep, drawn end to end. Only when each day finishes does its total return (8, 6, 4) flow back to update V(Study). The updates are correct on average but jumpy, because a single day's total carries the noise of every coin flip in it. Long episodes mean long waits and wide swings. That waiting and that swing are exactly what TD attacks next.
07

Temporal Difference: Don't Wait, Guess From the Next Guess

You are driving home and you estimated 30 minutes. Ten minutes in, you hit traffic the moment you merge. You do not wait until you arrive to learn, you update right now: "this looks more like 40 minutes". You corrected your guess using a newer guess, before the trip ended. That is temporal difference learning.

Formally, TD replaces the full return G with a one step estimate: the reward you just got plus the discounted value of where you landed. Update toward that:

V(s) \;\leftarrow\; V(s) + \alpha\,\big[\,\underbrace{R + \gamma V(s')}_{\text{TD target}} - V(s)\,\big],\qquad \underbrace{\delta = R + \gamma V(s') - V(s)}_{\text{the TD error}}
The target uses the current estimate V(s′) in place of the unknown future. Learning from a guess to improve a guess is bootstrapping.

Run it on the world. V(Study) = 0, V(Pass) = 8 currently. One step happens: from Study you go to Pass, collecting reward R = 0, with γ = 0.9. The TD target is 0 + 0.9 × 8 = 7.2. The error is 7.2 − 0 = 7.2. With α = 0.5, V(Study) ← 0 + 0.5 × 7.2 = 3.6, learned from one step, no need to reach Sleep. The value of Pass seeped one state backward into Study immediately.

The character of TD, the mirror image of MC: biased but low variance. It leans on V(s′), which is currently wrong, so early estimates are systematically off, that is the bias. But it only absorbs one step of randomness per update instead of a whole episode's worth, so it is far steadier. And it needs no ending: TD learns online, step by step, in endless processes. That single property is why almost all of deep RL is built on TD, not MC.

TD: Learn From the Very Next Step
One transition, Study to Pass. The TD target R + γV(Pass) = 7.2 is built from the immediate reward plus the existing estimate of the next state. The error 7.2 flows back into V(Study) the instant the step completes, no waiting for Sleep. Value information walks backward one state per step, the same backward seepage you saw drive Q-learning. Bias because V(Pass) is itself still a guess; low variance because only one coin flip entered.
BUT WAIT TD updates a guess using another guess. At the start every value is wrong, so it is wrong numbers correcting wrong numbers. Why does this converge to the truth instead of spiraling into nonsense?

You asked almost this exact question in the Q-learning blog, and the answer is the same machinery, worth seeing in this purer setting.

The save is that each TD target is not pure guess. It contains one piece of hard reality: the actual reward R the environment just paid. Every update is one part real reward, γ parts current belief. Truth physically enters the system at every step through R, and the bootstrap spreads it.

Watch it on the world. At the very start all values are 0, so most TD targets are 0 + 0.9 × 0 = 0 and nothing moves, except at the step into a rewarding terminal, where the target is the real reward with no guess attached. That genuine number anchors the state next to the reward. The following step has a target built partly from that now correct neighbor, and so truth ripples outward from the rewards, one state per sweep. The wrongness washes out from the anchors, it does not amplify.

The formal guarantee: the Bellman expectation operator is a γ contraction, every sweep pulls the value estimates at least a factor γ closer to the unique true V, so errors shrink geometrically. With a decaying learning rate and all states kept visited, tabular TD provably converges to the exact value. The honest footnote, identical to Q-learning: this proof is for tables. Bolt on a neural network for V and the contraction guarantee is gone, which is the deadly triad and the reason target networks exist.

08

One Dial: MC and TD Are the Two Ends

Here is the unification that makes it all click. MC waits for the full return. TD looks one step ahead. What about two steps? Three? n steps? They are all valid, and they form a continuous dial called n step TD:

G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})
Take n real reward steps, then bootstrap with V for the rest. n = 1 is TD. n = ∞ (run to the end) is Monte Carlo. Everything between is a blend.

The dial is a bias variance trade. Turn toward TD (small n): low variance, high bias, fast online updates. Turn toward MC (large n): high variance, low bias, must wait. The sweet spot is usually in the middle, and there is an elegant way to average over all n at once with a decay factor λ, giving TD(λ) and eligibility traces, a single knob λ sliding smoothly from TD(0) = one step TD to TD(1) = Monte Carlo.

The Bias Variance Dial, Drawn
The single dial under everything. Far left, TD(0): one real reward then bootstrap, steady but biased. Far right, Monte Carlo: all real rewards, no bootstrap, unbiased but jumpy. n step methods are the labeled notches between. The lower panel shows the trade as two crossing curves: as n grows, variance climbs and bias falls. Most strong methods, including the GAE in your PPO blog, live deliberately in the middle of this dial.
09

Now Watch Them Inside the RL You Know

These are not separate topics from RL. They are its skeleton. Every algorithm in your other blogs is one of these primitives wearing a costume.

What you knowIs reallyThe connection
The state in any RL problema Markov statechosen rich enough that the future depends only on it; a transformer's context window is one
An environment + your policyan MDP collapsed to a chainfixing π turns the action menu into one reward bearing Markov chain
REINFORCE returnsMonte Carlofull episode returns, unbiased, high variance, the reason REINFORCE is noisy
PPO / Q-learning criticTDV and Q trained by the TD update of §7; the δ you bootstrap is the TD error
GAE in PPOTD(λ)exactly the n step dial of §8, λ averaging trades the critic's bias against the return's variance
The Q-learning maxTD + controlTD applied to Q with a max in the target, learning the best policy off policy
DQN target networkstabilized bootstrappingfreezing V(s′) in the TD target so the guess driving the guess holds still

Read that table top to bottom and the whole field reorganizes. Markov gives you a state worth reasoning about. The MDP gives you choices and rewards on top of it. Value is the thing you want. Monte Carlo and TD are the two ways to estimate it from experience, and n step / TD(λ) is the dial between them. Policy gradient methods like PPO use these values to reduce the variance of their gradients; value methods like Q-learning use them as the whole answer. There is nothing else underneath. These four ideas are the foundation.

The one paragraph summary

The Markov property says the present state holds all you need to predict the future, which is true whenever you define the state richly enough. A Markov chain is a world obeying it; an MDP adds actions, rewards, and a policy, and fixing the policy turns it back into a chain with payouts. The thing we want is the value, the expected total reward from a state, which we can never see directly, only sample. Monte Carlo estimates it by living whole episodes and averaging their returns: unbiased, high variance, needs an ending. TD estimates it after a single step by trusting its own next guess plus the real reward just received: biased, low variance, works online forever. n step TD and TD(λ) are the dial between them, and every RL algorithm you know, REINFORCE, PPO, GAE, Q-learning, DQN, is one of these primitives in a costume.