ADMM
- When the object funtion is separable
minimizef(x)+g(z), subject to Ax+Bz≤c
tranlate into:
Lρ=f(x)+g(z)+yT(Ax+Bz−c)+(ρ/2)||Ax+Bz−c||22
solution:
xk+1:=argmin xLρ(x,zk,yk)zk+1:=argmin zLρ(xk+1,z,yk)yk+1:=yk+ρ(Axk+1+Bzk+1−c)
xk is only an intermediate result. The algorithm update zk+1,yk+1 based on last zk,yk
Extend of ADMM
- Extend of ADMM: multiple blocks (note may not convergence)
minimizeN∑i=1fi(xi), subject to N∑i=1Aixi≤c
The Gauss-Seidel update:
xk+11:=argmin xLρ(x1,xk2,…,xkN,yk)…xk+1i:=argmin xLρ(xk+11,xk+12,…,xi,xki+1,…,xkN,yk)yk+1:=yk+ρ(N∑i=1Aixk+1i−c)
- Extend of ADMM: Parallel Distribution
The Jacobian update is possible with extra conditions (Ai is mutually orthogonal):
xk+1i:=argmin xLρ(xk1,xk2,…,xi,xki+1,…,xkN,yk)yk+1:=yk+ρ(N∑i=1Aixk+1i−c)
- Extend of ADMM: Addtional regularization
xk+1i:=argmin xLρ(xk+11,xk+12,…,xi,xki+1,…,xkN,yk)+α||xi−xki||22
Application
- Separate objective function (Type I)
minimizef(x)+g(x)
rewrite:
minimizef(x)+g(z), subject to x−z=0
- Constrained Convex optimization (Type II)
minimizef(x), subject to x∈C
rewrite:
minimizef(x)+g(z), subject to x−z=0
where g(z) is indicator function
g(z)=1C(z)={1, if x∈C0,otherwise
Application
- loss function + regularization
minimize l(x)+λ||x||1
Rewrite:
minimize l(x)+λ||z||1 , subject to x−z=0
minimize ||x||1, subject to Ax−b=0
rewrite:
minimize||x||1+1C(z), subject to x−z=0 where C={x∈Rn|Ax=b}
Another Examples: Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM, AAAI 2017
2-Block ADMM Convergence analysis with variational inequality
What is variational inequality
Given a convex set Ω∈Rn and a convex function F:Rn→R, the variational inequality is to find a x∗
(x−x∗)TF(x∗)≥0,∀x∈Ω
For diffierential convex problem min{Θ(x)|x∈Ω}, if x∗ is its solution,
then from the point of x∗,
all loss decreasing set: Sd(x∗)={s∈Rn|sTδΘ(x∗)<0}
and all possible update set Sf(x∗)={s∈Rn|s=x−x∗,x∈Ω}
have no overlap:
(x−x∗)TδΘ(x∗)≥0,∀x∈Ω
If Θ(x) is second-order derivable, the Hessian matrix δ2Θ(x) is symmetric.
In VI(Ω,F), if F is derivable, we don't require the Jacobian maxtrix δF(x) to be symmetric.
Able to Convergence
- 2-Block ADMM (equation constraint as an example)
minimize f1(x)+f2(z) , subject to Ax+Bz=c
where
f1(x) : Rn1→R,
f2(z) : Rn2→R, c∈Rm. If we solve with ADMM, we have
xk+1:=argmin x(f1(x)+(ρ/2)||Ax+Bzk−c−uk||22)zk+1:=argmin z(f2(z)+(ρ/2)||Axk+1+Bz−c−uk||22)uk+1:=uk−ρ(Axk+1+Bzk+1−c)
we denote v=(z,u), w=(x,z,u). Our goal is to prove with iteration time k increases, v goes to v∗, w goes to w∗
Able to Convergence
minimize f1(x)+f2(z) , subject to Ax+Bz=c
based on the mix-max of lagrange function, for the solution (x∗,z∗,u∗) of the problem, we have
{f1(x)−f1(x∗)+(x−x∗)T(−ATu∗)≥0f2(z)−f2(z∗)+(z−z∗)T(−BTu∗)≥0(u−u∗)T(Ax∗+Bz∗−c)≥0
by defining
λ=(xz),
w=(xzu),
G(w)=(−ATu−BTuAx+Bz−c) , and
F(λ)=f1(x)+f2(z)
origin task is actually the problem VI(Ω,G,F):
F(λ)−F(λ∗)+(w−w∗)TG(w∗)≥0,∀w∈Ω
f1(x)−f1(xk+1)+(x−xk+1)T(−ATuk+ρAT(Axk+1+Bzk−c))≥0
holds for any x∈Rn1, similarly,
f2(z)−f2(zk+1)+(z−zk+1)T(−BTuk+ρBT(Axk+1+Bzk+1−c))≥0
holds for any z∈Rn2. Then substituing uk in the two inequaility, we get
f1(x)−f1(xk+1)+(x−xk+1)T(−ATuk+ρAT(Axk+1+Bzk−c))=f1(x)−f1(xk+1)+(x−xk+1)T(−AT(uk+1+ρ(Axk+1+Bzk+1−c))+ρAT(Axk+1+Bzk−c))=f1(x)−f1(xk+1)+(x−xk+1)T(−ATuk+1+ρATB(zk−zk+1))≥0
holds for any x∈Rn1 and
f2(z)−f2(zk+1)+(z−zk+1)T(−BTuk+1)≥0
holds for any z∈Rn2.
Next, we add the two inequaility together and get:
f1(x)−f1(xk+1)+(x−xk+1)T(−ATuk+1+ρATB(zk−zk+1))+f2(z)−f2(zk+1)+(z−zk+1)T(−BTuk+ρBT(Axk+1+Bzk+1−c))≥0
Denote λ=(x,z) and F(λ)=f1(x)+f2(z), above inequaility can be rewriten in a more compact form,
F(λ)−F(λk+1)+(x−xk+1z−zk+1)T{(−ATuk+1−BTuk+1)+ρ(ATBT)B(zk−zk+1)+(000ρBTB)(xk+1−xkzk+1−zk)}≥0 for ∀λ∈(Rn1,Rn2) holds. We addtionally add the scaled dual variable u in the inequaility to get
F(λ)−F(λk+1)+(x−xk+1z−zk+1u−uk+1)T{(−ATuk+1−BTuk+1Axk+1+Bzk+1−c)+ρ(ATBT0)B(zk−zk+1)+(000ρBTB01/ρIm)(xk+1−xkzk+1−zk)}≥0
Last inequaility holds for all w=(x,z,u), thus we can set w=w∗ and get
F(λ∗)−F(λk+1)+(w∗−wk+1)T{(−ATuk+1−BTuk+1Axk+1+Bzk+1−c)+ρ(ATBT0)B(zk−zk+1)+(000ρBTB01/ρIm)(xk+1−xkzk+1−zk)}≥0
For simplicity, we rewrite it to
F(λ∗)−F(λk+1)+(w∗−wk+1)T{G(wk+1+P(zk,zk+1)+H0(vk+1−vk)}≥0
Then move several term form left to right, we get:
(w∗−wk+1)TH0(vk+1−vk)≥(wk+1−w∗)TP(zk,zk+1)+F(λk+1)−F(λ∗)+(wk+1−w∗)TG(wk+1)
As the first row of H0 are all zero, we can rewrite it to (H represents the last two rows of H0):
(vk+1−v∗)H(vk−vk+1)≥(wk+1−w∗)TP(zk,zk+1)+F(λk+1)−F(λ∗)+(wk+1−w∗)TG(wk+1)
On the other hand, for the second part, we have:
{F(λk+1)−F(λ∗)}+(wk+1−w∗)TG(wk+1)≥{F(λk+1)−F(λ∗)}+(wk+1−w∗)TG(w∗)≥0
The former inquaility holds as F is monotone and the later inquaility holds as it is the definition of VI.
It is clear that
(vk+1−v∗)H(vk−vk+1)≥(wk+1−w∗)TP(zk,zk+1)
For the right side, we have
(wk+1−w∗)TP(zk,zk+1)=(wk+1−w∗)Tρ(ATBT0)B(zk−zk+1)=(B(zk−zk+1))Tρ(A,B,0)(wk+1−w∗)=(B(zk−zk+1))Tρ(Axk+1+Bzk+1)−(Ax∗+Bz∗)=(B(zk−zk+1))Tρ(Axk+1+Bzk+1−c)=(B(zk−zk+1))T(uk−uk+1)=(uk−uk+1)TB(zk−zk+1)
Recall that
f2(z)−f2(zk+1)+(z−zk+1)T(−BTuk+1)≥0
holds for any z∈Rn2. It can also be known that
f2(z)−f2(zk)+(z−zk)T(−BTuk)≥0
holds for any z∈Rn2. Set the z to be zk in the former inequation and zk+1 in the latter inequation,
and add the two inequations, we get:
(zk−zk+1)T(−BTuk+1)+(zk+1−zk)T(−BTuk)=(uk−uk+1)TB(zk−zk+1)≥0
It ensures
(wk+1−w∗)TP(zk,zk+1)=(uk−uk+1)TB(zk−zk+1)≥0
Overall, we have
(vk+1−v∗)H(vk−vk+1)≥(wk+1−w∗)TP(zk,zk+1)≥0
Having the
(vk+1−v∗)H(vk−vk+1)≥0
It is easy to prove that
(vk−v∗)TH(vk−v∗)=||(vk−v∗)||2H=||((vk+1−v∗)+(vk−vk+1))||2H=||((vk+1−v∗)||2H+2(vk+1−v∗)H(vk−vk+1)+||(vk−vk+1))||2H≥||((vk+1−v∗)||2H+||(vk−vk+1))||2H
namely,
||((vk+1−v∗)||2H≤||(vk−v∗)||2H−||(vk−vk+1))||2H
the sequence is monotonically deceasing (in the word, it can convergence). It further can be proved to convergene at rate (1/t)