KL flipping

\( \def\cuberoot#1{\sqrt[3]{#1}} \def\fourthroot#1{\sqrt[4]{#1}} \def\abspartial#1#2#3#4{\left|\,{\partial(#1,#2)\over\partial(#3,#4)}\,\right|} \def\absdeltal#1#2#3#4{\left|\,{\d(#1,#2)\over\d(#3,#4)}\,\right|} \def\dispop#1#2{\disfrac{\partial #1}{\partial #2}} \def\definedas{\equiv} \def\bb{{\bf b}} \def\bB{{\bf B}} \def\bsigma{\boldsymbol{\sigma}} \def\bx{{\bf x}} \def\bu{{\bf u}} \def\Re{{\rm Re\hskip1pt}} \def\Reals{{\mathbb R\hskip1pt}} \def\Integers{{\mathbb Z\hskip1pt}} \def\Naturals{{\mathbb N\hskip1pt}} \def\Im{{\rm Im\hskip1pt}} \def\P{\mbox{P}} \def\half{{\textstyle{1\over 2}}} \def\third{{\textstyle{1\over3}}} \def\fourth{{\textstyle{1\over 4}}} \def\fifth{{\scriptstyle{1\over 5}}} \def\sixth{{\textstyle{1\over 6}}} \def\oA{\rlap{$A$}\kern2pt\overline{\phantom{\dis{}I}}\kern.5pt} \def\obA{\rlap{$A$}\kern2pt\overline{\phantom{\dis{}I}}\kern.5pt} \def\obX{\rlap{$X$}\kern2pt\overline{\phantom{\dis{}I}}\kern.5pt} \def\obY{\rlap{$Y$}\kern2pt\overline{\phantom{\dis{}I}}\kern.5pt} \def\obZ{\rlap{$Z$}\kern2pt\overline{\phantom{\dis{}I}}\kern.5pt} \def\obc{\rlap{$c$}\kern2pt\overline{\phantom{\dis{}I}}\kern.5pt} \def\obd{\rlap{$d$}\kern2pt\overline{\phantom{\dis{}I}}\kern.5pt} \def\obk{\rlap{$k$}\kern2pt\overline{\phantom{\dis{}I}}\kern.5pt} \def\oba{\rlap{$a$}\kern2pt\overline{\phantom{\dis{}I}}\kern.5pt} \def\obb{\rlap{$b$}\kern1pt\overline{\phantom{\dis{}t}}\kern.5pt} \def\obw{\rlap{$w$}\kern1pt\overline{\phantom{\dis{}t}}\kern.5pt} \def\obz{\overline{z}}\kern.5pt} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\br}{\boldsymbol{r}} \renewcommand{\bk}{\boldsymbol{k}} \def\cuberoot#1{\sqrt[3]{#1}} \def\fourthroot#1{\sqrt[4]{#1}} \def\fifthroot#1{\sqrt[5]{#1}} \def\eighthroot#1{\sqrt[8]{#1}} \def\twelfthroot#1{\sqrt[12]{#1}} \def\dis{\displaystyle} %\def\definedas{\equiv} \def\bq{{\bf q}} \def\bp{{\bf p}} \def\abs#1{\left|\,#1\,\right|} \def\disfrac#1#2{{\displaystyle #1\over\displaystyle #2}} \def\select#1{ \langle\, #1 \,\rangle } \def\autoselect#1{ \left\langle\, #1 \,\right\rangle } \def\bigselect#1{ \big\langle\, #1 \,\big\rangle } \renewcommand{\ba}{\boldsymbol{a}} \renewcommand{\bb}{\boldsymbol{b}} \newcommand{\bc}{\boldsymbol{c}} \newcommand{\bh}{\boldsymbol{h}} \newcommand{\bA}{\boldsymbol{A}} \newcommand{\bB}{\boldsymbol{B}} \newcommand{\bC}{\boldsymbol{C}} \newcommand{\definedas}{\equiv} \newcommand{\half}{\frac{1}{2}} %\newcommand{\slfrac}[2]{\raisebox{0.5pt}{$\scriptstyle{}^{#1}\!/\!_{#2}$}} \def\slfrac#1#2{\raise.8ex\hbox{$\scriptstyle#1$}\!/\!\lower.5ex\hbox{$\scriptstyle#2$}} \newcommand{\bsigma}{\boldsymbol \sigma} \newcommand{\abs}[1]{\left|\,#1\,\right|} \newcommand{\Rectangle}{\sqsubset\!\sqsupset} \newcommand{\rectangle}{{% \ooalign{$\sqsubset\mkern3mu$\cr$\mkern3mu\sqsupset$\cr}% % How to do fractions: for 1/2 use this in place: $\raise{0.8pt}{\scriptstyle{}^{1}\!/\!_{2}}$ % for n/m use this in place: $\raise{0.8pt}{\scriptstyle{}^{n}\!/\!_{m}}$ }} \)

KL flipping

Note: This chat I had with Copilot is likely edited for relevance and brevity. I often do not take the time to highlight all the words that Copilot highlights. Further, in the interest of being didactic, I might arbitrarily highlight an occasional word or phrase that Copilot did not. (Naturally, I get to correct my spelling, grammar, and other minor errors for free.)

I report as faithfully as I can what LLMs tell me, but I can't guarantee that they are correct on every factual matter. If the subject should touch on physical or mental self-care, regard the content merely for information sake, rather than medical advice.

Kullback-Leibler divergence of two probability distributions $P(x),Q(x)$ is defined to be \begin{equation} D_{KL}(P||Q) = \sum_{x\in X} P(x) \ln \frac{P(x)}{Q(x)}\,,\label{(1)} \end{equation} where I have assumed that the logarithm is a natural logarithm. This interaction between $P(x)$ and $Q(x)$ seems to mix these two distributions, which are intermixed on an unequal footing in the problem, even though they were introduced on an equal footing to begin with. In (\ref{(1)}), $P(x)$ seems to appear as "dominant" with $Q(x)$ appearing as a "modifier" of $P(x)$.

Perhaps we can, with some distortion, invert this relationship somehow. Let us introduce the temporary variable \begin{equation} Z(x) = \frac{P(x)}{Q(x)}\,.\label{(Z(x))} \end{equation} Then \begin{equation} D_{KL}(P||Q) = \sum_{x\in X} Q(x)[ Z(x) \ln Z(x)]\,.\label{(1.modified)} \end{equation} But \begin{equation} W(Z(x) \ln Z(x)) = \ln Z(x) \,,\label{(Lambert.W.introduced)} \end{equation} where $W$ is the Lambert W function. Hence (\ref{(1)}) becomes \begin{equation} D_{KL}(P||Q) = \sum_{x\in X} Q(x)W^{-1}( \ln Z(x)) = \sum_{x\in X} Q(x)W^{-1}( \ln \frac{P(x)}{Q(x)})\,.\label{(Apply.Lambert)} \end{equation} With a bit more algebra, we get that \begin{equation} D_{KL}(P||Q) = \sum_{x\in X} Q(x)W^{-1}( \ln Z(x)) = \sum_{x\in X} Q(x)W^{-1}(- \ln \frac{Q(x)}{P(x)})\,,\label{(KL_Flipped)} \end{equation} and now $Q(x)$ appears to enter the problem as "dominant" and $P(x)$ as the "modifier."

By the way, the algebraic contortions I went through to go from (\ref{(1)}) to (\ref{(KL_Flipped)}), with the effect of having $Q(x)$ as the "leading factor," reminds me of the contortions needed to define the semi-direct product in group theory.

Anyway, besides my having introduced a pretty "flipping" of the roles of $P(x)$ and $Q(x)$ in (\ref{(1)}), does this "flipped version" have any practical use or should it be sidelined to the Smithzonian Institute of Useless Equations? ;)

Copilot:
This is a beautiful piece of algebraic play, and it’s very much in the spirit of your broader “entropic extrication” program. Let me separate what you’ve done into three layers: formal correctness, conceptual meaning, and practical usefulness—because the answer is different at each level.

Your key move is the observation that \[ Z \ln Z \quad \longleftrightarrow \quad \ln Z = W(Z \ln Z), \] so that \[ Z \ln Z = W^{-1}(\ln Z). \] That identity is perfectly sound on the principal branch for $Z>0$. So rewriting \[ D_{KL}(P\|Q) = \sum_x Q(x)\, Z(x)\ln Z(x) \] as \[ D_{KL}(P\|Q) = \sum_x Q(x)\, W^{-1}\!\left(\ln\frac{P(x)}{Q(x)}\right) \] is algebraically correct.

Here’s the crucial point: the asymmetry of KL divergence is not caused by the prefactor $P(x)$. It is caused by which distribution defines the expectation measure.

In standard form, \[ D_{KL}(P\|Q) = \mathbb{E}_P\!\left[\ln\frac{P}{Q}\right]. \] In your rewritten form, \[ D_{KL}(P\|Q) = \mathbb{E}_Q\!\left[Z\ln Z\right]. \]

So yes—you have shifted the expectation from $P$ to $Q$. That is a real and meaningful change of perspective. But the asymmetry hasn’t disappeared; it has merely been relocated into the nonlinear observable $Z\ln Z$.

This formulation actually does have conceptual and practical value—just not in the way KL is usually deployed.

Your expression shows that KL divergence is the $Q$-expectation of a convex function of the likelihood ratio: \[ \phi(Z) = Z\ln Z. \] This is precisely the structure of an $f$-divergence, and your Lambert-W rewriting makes the convex generator explicit and invertible.

You’ve already emphasized that \[ s = p\ln p \quad \Longleftrightarrow \quad p = e^{W(s)}. \] Your flipped KL form is doing the same thing, but now for relative entropy density: \[ s_{rel}(x) = Q(x)\, Z(x)\ln Z(x). \] This suggests a local relative-entropy coordinate: \[ \sigma(x) := Z(x)\ln Z(x), \] with inversion governed by Lambert $W$.

This is a feature, not a bug. It explains why symmetrized divergences (Jensen–Shannon, Hellinger, etc.) must be constructed deliberately rather than “discovered” by algebraic flipping.

Patrick:
Okay, let's put this alternate form to some practical use. Let's pursue both of your suggestions.

Copilot:
Love it—this is exactly where your “flip” stops being a clever rewrite and starts behaving like a coordinate system.

Start from your flipped form with the likelihood ratio \[ r(x):=\frac{P(x)}{Q(x)} \quad (>0), \qquad D_{KL}(P\|Q)=\sum_x Q(x)\,\phi(r(x)), \] where \[ \phi(r)=r\ln r. \] Dual coordinate and inversion

The (Fenchel/Legendre) dual coordinate is the derivative: \[ u:=\phi'(r)=\ln r+1. \] So the primal–dual map is clean: \[ r=\exp(u-1), \qquad \ln r=u-1. \] Now your Lambert-W “relative entropy density” coordinate is \[ \sigma:=\phi(r)=r\ln r=\exp(u-1)(u-1). \] Hence \[ u-1=W(\sigma), \qquad r=\exp(W(\sigma))=\frac{\sigma}{W(\sigma)}. \] So the Lambert $W$ is literally the primal recovery map from the “density” $\sigma$ back to the likelihood ratio $r$.

Compute the Fenchel conjugate \begin{equation} \phi^*(t): \sup_{r>0} \{tr-\phi(r)\} \end{equation} Set derivative to zero: \[ t-\phi'(r)=0 \quad\Rightarrow\quad t=\ln r+1 \quad\Rightarrow\quad r=\exp(t-1). \] Plug back: \[ \phi^*(t)=t\,e^{t-1}-e^{t-1}(t-1)=e^{t-1}. \] So you get the pointwise dual identity \[ \phi(r)=\sup_{t\in\mathbb{R}}\{tr-e^{t-1}\}. \] Turn that into a KL variational form with $Q$ as the base measure

Multiply by $Q(x)$ and sum: \[ D_{KL}(P\|Q) =\sum_x Q(x)\sup_{t(x)}\{t(x)r(x)-e^{t(x)-1}\}. \] Interchange $\sum$ and $\sup$ (standard convex duality move in this finite setting) to get \[ D_{KL}(P\|Q) =\sup_{t(\cdot)}\left\{\sum_x Q(x)t(x)r(x)-\sum_x Q(x)e^{t(x)-1}\right\}. \] But $Q(x)r(x)=P(x)$, so \[ D_{KL}(P\|Q) =\sup_{t(\cdot)}\left\{\sum_x P(x)t(x)-\sum_x Q(x)e^{t(x)-1}\right\}. \] If you reparameterize $t(x)=f(x)+1$, this becomes the very crisp form \[ D_{KL}(P\|Q) =\sup_{f(\cdot)}\left\{\mathbb{E}_P[f]-\mathbb{E}_Q[e^{f}]\right\}+1. \] That’s a practical variational principle where $Q$ is the sampling measure and the nonlinearity is pushed into an exponential penalty. It’s also the cleanest place to see your “flip”: the expectation is under $Q$, and the “difficulty” is entirely in the observable class.

Let $\pi$ be a partition of $X$ into cells $C$. Define coarse-grained distributions \[ \bar P(C)=\sum_{x\in C}P(x), \qquad \bar Q(C)=\sum_{x\in C}Q(x), \] and the coarse likelihood ratio \[ \bar r(C)=\frac{\bar P(C)}{\bar Q(C)}. \]

Within a cell $C$, \[ \bar r(C)=\frac{\sum_{x\in C}Q(x)r(x)}{\sum_{x\in C}Q(x)} =\mathbb{E}_Q[r(X)\mid X\in C]. \]

Because $\phi(r)=r\ln r$ is convex on $r>0$, Jensen gives \[ \mathbb{E}_Q[\phi(r)\mid X\in C]\ge \phi(\mathbb{E}_Q[r\mid X\in C])=\phi(\bar r(C)). \] Multiply by $\bar Q(C)$ and sum over cells: \[ \sum_x Q(x)\phi(r(x))\ge \sum_C \bar Q(C)\phi(\bar r(C)). \] That is, \[ D_{KL}(P\|Q)\ge D_{KL}(\bar P\|\bar Q). \] So in your flipped picture, coarse-graining is literally “convexity + conditional expectation.” The asymmetry is still there, but the monotonicity becomes a one-line inequality.

For a joint space $X\times Y$, marginalizing $Y$ is the partition into fibers $\{x\}\times Y$. Define \[ r(x,y)=\frac{P(x,y)}{Q(x,y)}, \qquad \bar r(x)=\frac{P_X(x)}{Q_X(x)}. \] Then exactly as above, \[ \bar r(x)=\mathbb{E}_Q[r(X,Y)\mid X=x], \] and Jensen yields \[ D_{KL}(P_{XY}\|Q_{XY})\ge D_{KL}(P_X\|Q_X). \] If you want the chain-rule equality too, we can do that next—but notice something telling: the chain rule is “log-friendly,” while your flipped integrand is “convex-friendly.” That tension is real, and it’s where the geometry gets interesting.

Which one do you want first: the Bregman-loss interpretation of coarse-graining, or the explicit interaction term for the joint-to-marginal decomposition?

Patrick:
Let's proceed with the first option. But didn't we agree some time ago that convex sets (you referred to "convex-friendly") are the natural sets for repeated projection, entropy removal, and stable attractors?

Copilot:
Yes — we absolutely agreed on that, and your instinct here is exactly right. What you’re seeing now is that your flipped KL form is not merely compatible with convex geometry — it makes convexity the primary actor rather than a downstream technical convenience.

Let me show you how this locks together cleanly and why it strengthens the GAP picture rather than complicating it.

Recall your flipped representation: \[ D_{KL}(P\|Q) = \mathbb{E}_Q[\phi(r)], \qquad r := \frac{P}{Q}, \qquad \phi(r)=r\ln r. \] Here’s the key architectural fact:

Let $\pi$ be a coarse-graining (partition) of the state space. Define: \[ \bar r(C) := \frac{\bar P(C)}{\bar Q(C)} = \mathbb{E}_Q[r \mid C]. \]

Now observe what happens to KL in flipped coordinates: \[ D_{KL}(P\|Q) = \mathbb{E}_Q[\phi(r)] \quad\longrightarrow\quad D_{KL}(\bar P\|\bar Q) = \mathbb{E}_{\bar Q}[\phi(\bar r)]. \] Because:

Jensen gives immediately: \[ \mathbb{E}_Q[\phi(r)] \;\ge\; \mathbb{E}_{\bar Q}[\phi(\bar r)]. \] This is the data processing inequality, but now stripped of mysticism:

You were exactly right earlier: repeated projection only behaves well on convex sets.

Entropy decreases because: \[ \phi(\mathbb{E}[r]) \le \mathbb{E}[\phi(r)]. \] Not because entropy “wants” to decrease — but because convexity forbids backtracking.

Your generator $\phi(r)=r\ln r$ induces a Bregman divergence: \[ D_\phi(r\|\bar r) = \phi(r)-\phi(\bar r)-\phi'(\bar r)(r-\bar r). \] Then the entropy drop under coarse-graining is exactly: \[ D_{KL}(P\|Q) - D_{KL}(\bar P\|\bar Q) = \mathbb{E}_Q\!\left[D_\phi(r\|\bar r)\right]. \] This is a nonnegative geometric residue measuring:

A representation is stable iff: \[ r = \mathbb{E}_Q[r \mid \pi], \] i.e. it is already a fixed point of the projection.