{"title": "Riemannian approach to batch normalization", "book": "Advances in Neural Information Processing Systems", "page_first": 5225, "page_last": 5235, "abstract": "Batch normalization (BN) has proven to be an effective algorithm for deep neural network training by normalizing the input to each neuron and reducing the internal covariate shift. The space of weight vectors in the BN layer can be naturally interpreted as a Riemannian manifold, which is invariant to linear scaling of weights. Following the intrinsic geometry of this manifold provides a new learning rule that is more efficient and easier to analyze. We also propose intuitive and effective gradient clipping and regularization methods for the proposed algorithm by utilizing the geometry of the manifold. The resulting algorithm consistently outperforms the original BN on various types of network architectures and datasets.", "full_text": "Riemannian approach to batch normalization\n\nMinhyung Cho\n\nJaehyung Lee\n\nApplied Research Korea, Gracenote Inc.\n\nmhyung.cho@gmail.com\n\njaehyung.lee@kaist.ac.kr\n\nAbstract\n\nBatch Normalization (BN) has proven to be an effective algorithm for deep neural\nnetwork training by normalizing the input to each neuron and reducing the internal\ncovariate shift. The space of weight vectors in the BN layer can be naturally\ninterpreted as a Riemannian manifold, which is invariant to linear scaling of\nweights. Following the intrinsic geometry of this manifold provides a new learning\nrule that is more ef\ufb01cient and easier to analyze. We also propose intuitive and\neffective gradient clipping and regularization methods for the proposed algorithm\nby utilizing the geometry of the manifold. The resulting algorithm consistently\noutperforms the original BN on various types of network architectures and datasets.\n\n1\n\nIntroduction\n\nBatch Normalization (BN) [1] has become an essential component for breaking performance records\nin image recognition tasks [2, 3]. It speeds up training deep neural networks by normalizing the\ndistribution of the input to each neuron in the network by the mean and standard deviation of the\ninput computed over a mini-batch of training data, potentially reducing internal covariate shift [1],\nthe change in the distributions of internal nodes of a deep network during the training.\nThe authors of BN demonstrated that applying BN to a layer makes its forward pass invariant to\nlinear scaling of its weight parameters [1]. They argued that this property prevents model explosion\nwith higher learning rates by making the gradient propagation invariant to linear scaling. Moreover,\nthe gradient becomes inversely proportional to the scale factor of each weight parameter. While this\nproperty could stabilize the parameter growth by reducing the gradients for larger weights, it could\nalso have an adverse effect in terms of optimization since there can be an in\ufb01nite number of networks,\nwith the same forward pass but different scaling, which may converge to different local optima owing\nto different gradients. In practice, networks may become sensitive to the parameters of regularization\nmethods such as weight decay.\nThis ambiguity in the optimization process can be removed by interpreting the space of weight\nvectors as a Riemannian manifold on which all the scaled versions of a weight vector correspond\nto a single point on the manifold. A properly selected metric tensor makes it possible to perform\na gradient descent on this manifold [4, 5], following the gradient direction while staying on the\nmanifold. This approach fundamentally removes the aforementioned ambiguity while keeping the\ninvariance property intact, thus ensuring stable weight updates.\nIn this paper, we \ufb01rst focus on selecting a proper manifold along with the corresponding Riemannian\nmetric for the scale invariant weight vectors used in BN (and potentially in other normalization\ntechniques [6, 7, 8]). Mapping scale invariant weight vectors to two well-known matrix manifolds\nyields the same metric tensor, leading to a natural choice of the manifold and metric. Then, we derive\nthe necessary operators to perform a gradient descent on this manifold, which can be understood\nas a constrained optimization on the unit sphere. Next, we present two optimization algorithms -\ncorresponding to the Stochastic Gradient Descent (SGD) with momentum and Adam [9] algorithms.\nAn intuitive gradient clipping method is also proposed utilizing the geometry of this space. Finally,\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fBN(z) =\n\n(1)\n\nz \u2212 E[z]\n\n(cid:112)Var[z]\n\n=\n\n(cid:112)\n\nw(cid:62)(x \u2212 E[x])\nw(cid:62)Rxxw\n\n(cid:112)\n\nu(cid:62)(x \u2212 E[x])\nu(cid:62)Rxxu\n\n=\n\nwe illustrate the application of these algorithms to networks with BN layers, together with an effective\nregularization method based on variational inference on the manifold. Experiments show that the\nresulting algorithm consistently outperforms the original BN algorithm on various types of network\narchitectures and datasets.\n\n2 Background\n\n2.1 Batch normalization\n\nWe brie\ufb02y revisit the BN transform and its properties. While it can be applied to any single activation\nin the network, in practice it is usually inserted right before the nonlinearity, taking the pre-activation\nz = w(cid:62)x as its input. In this case, the BN transform is written as\n\nwhere w is a weight vector, x is a vector of activations in the previous layer, u = w/|w|, and Rxx\nis the covariance matrix of x. Note that BN(w(cid:62)x) = BN(u(cid:62)x). It was shown in [1] that\n\n\u2202BN(w(cid:62)x)\n\n\u2202x\n\n\u2202BN(u(cid:62)x)\n\n\u2202x\n\n=\n\nand\n\n\u2202BN(z)\n\n\u2202w\n\n=\n\n1\n|w|\n\n\u2202BN(z)\n\n\u2202u\n\n(2)\n\nillustrating the properties discussed in Sec. 1.\n\n2.2 Optimization on Riemannian manifold\n\nRecent studies have shown that various constrained optimization problems in Euclidian space can be\nexpressed as unconstrained optimization problems on submanifolds embedded in Euclidian space [5].\nFor applications to neural networks, we are interested in Stiefel and Grassmann manifolds [4, 10].\nWe brie\ufb02y review them here. The Stiefel manifold V(p, n) is the set of p ordered orthonormal vectors\nin Rn(p \u2264 n). A point on the manifold is represented by an n-by-p orthonormal matrix Y , where\nY (cid:62)Y = Ip. The Grassmann manifold G(p, n) is the set of p-dimensional subspaces of Rn(p \u2264 n).\nIt follows that span(A), where A \u2208 Rn\u00d7p, is understood to be a point on the Grassmann manifold\nG(p, n) (note that two matrices A and B are equivalent if and only if span(A) = span(B)). A point\non this manifold can be speci\ufb01ed by an arbitrary n-by-p matrix, but for computational ef\ufb01ciency,\nan orthonormal matrix is commonly chosen to represent a point. Note that the representation is not\nunique [5].\nTo perform gradient descent on those manifolds, it is essential to equip them with a Riemannian metric\ntensor and derive geometric concepts such as geodesics, exponential map, and parallel translation.\nGiven a tangent vector v \u2208 TxM on a Riemannian manifold M with its tangent space TxM at a\npoint x, let us denote \u03b3v(t) as a unique geodesic on M, with initial velocity v. The exponential map\nis de\ufb01ned as expx(v) = \u03b3v(1), which maps v to the point that is reached in a unit time along the\ngeodesic starting at x. The parallel translation of a tangent vector on a Riemannian manifold can\nbe obtained by transporting the vector along the geodesic by an in\ufb01nitesimally small amount, and\nremoving the vertical component of the tangent space [11]. In this way, the transported vector stays\nin the tangent space of the manifold at a new point.\nUsing the concepts above, a gradient descent algorithm for an abstract Riemannian manifold is given\nin Algorithm 1 for reference. This reduces to the familiar gradient descent algorithm when M = Rn,\nsince expyt\u22121(\u2212\u03b7 \u00b7 h) is given as yt\u22121 \u2212 \u03b7 \u00b7 \u2207f (yt\u22121) in Rn.\nAlgorithm 1 Gradient descent of a function f on an abstract Riemannian manifold M\nRequire: Stepsize \u03b7\nInitialize y0 \u2208 M\nfor t = 1,\u00b7\u00b7\u00b7 , T\n\nh \u2190 gradf (yt\u22121) \u2208 Tyt\u22121M where gradf (y) is the gradient of f at y \u2208 M\nyt \u2190 expyt\u22121 (\u2212\u03b7 \u00b7 h)\n\n2\n\n\f3 Geometry of scale invariant vectors\n\nAs discussed in Sec. 2.1, inserting the BN transform makes the weight vectors w, used to calculate\nthe pre-activation w(cid:62)x, invariant to linear scaling. Assuming that there are no additional constraints\non the weight vectors, we can focus on the manifolds on which the scaled versions of a vector collapse\nto a point. A natural choice for this would be the Grassmann manifold since the space of the scaled\nversions of a vector is essentially a one-dimensional subspace of Rn. On the other hand, the Stiefel\nmanifold can also represent the same space if we set p = 1, in which case V(1, n) reduces to the unit\nsphere. We can map each of the weight vectors w to its normalized version, i.e., w/|w|, on V(1, n).\nWe show that popular choices of metrics on those manifolds lead to the same geometry.\nTangent vectors to the Stiefel manifold V(p, n) at Z are all the n-by-p matrices \u2206 such that Z(cid:62)\u2206 +\n\u2206(cid:62)Z = 0 [4]. The canonical metric on the Stiefel manifold is derived based on the geometry of\nquotient spaces of the orthogonal group [4] and is given by\n\ngs(\u22061, \u22062) = tr(\u2206(cid:62)\n\n1 (I \u2212 ZZ(cid:62)/2)\u22062)\n\n(3)\nwhere \u22061, \u22062 are tangent vectors to V(p, n) at Z. If p = 1, the condition Z(cid:62)\u2206 + \u2206(cid:62)Z = 0 is\nreduced to Z(cid:62)\u2206 = 0, leading to gs(\u22061, \u22062) = tr(\u2206(cid:62)\nNow, let an n-by-p matrix Y be a representation of a point on the Grassmann manifold G(p, n).\nTangent vectors to the manifold at span(Y ) with the representation Y are all the n-by-p matrices\n\u2206 such that Y (cid:62)\u2206 = 0. Since Y is not a unique representation, the tangent vector \u2206 changes with\nthe choice of Y . For example, given a representation Y1 and its tangent vector \u22061, if a different\nrepresentation is selected by performing right multiplication, i.e., Y2 = Y1R, then the tangent vector\nmust be moved in the same way, that is \u22062 = \u22061R. The canonical metric, which is invariant under\nthe action of the orthogonal group and scaling [10], is given by\n\n1 \u22062).\n\ngg(\u22061, \u22062) = tr(cid:0)(Y (cid:62)Y )\u22121\u2206(cid:62)\n\n(cid:1)\n\n1 \u22062\n\n(4)\nwhere Y (cid:62)\u22061 = 0 and Y (cid:62)\u22062 = 0. For G(1, n) with a representation y, the metric is given by\ngg(\u22061, \u22062) = \u2206(cid:62)\n\n1 \u22062/y(cid:62)y. The metric is invariant to the scaling of y as shown below\n\n\u2206(cid:62)\n1 \u22062/y(cid:62)y = (k\u22061)(cid:62)(k\u22062)/(ky)(cid:62)(ky).\n\n(5)\nWithout loss of generality, we can choose a representation with y(cid:62)y = 1 to obtain gg(\u22061, \u22062) =\n1 \u22062), which coincides with the canonical metric for V(1, n). Hereafter, we will focus on the\ntr(\u2206(cid:62)\ngeometry of G(1, n) with the metric and representation chosen above, derived from the general\nformula in [4, 10].\nGradient of a function\n\nThe gradient of a function f (y) de\ufb01ned on G(1, n) is given by\n\ngradf = g \u2212 (yT g)y\n\nwhere gi = \u2202f /\u2202yi.\nExponential map\nemanating from y with initial velocity h is given by\n\nLet h be a tangent vector to G(1, n) at y. The exponential map on G(1, n)\n\n(6)\n\n(7)\n\n(8)\n\nexpy(h) = y cos|h| +\n\nh\n\n|h| sin|h|.\n\nIt can be easily shown that expy(h) = expy((1 + 2\u03c0/|h|)h).\nParallel translation\n\u2206 along the geodesic with the initial velocity h in a unit time is given by\n\npty(\u2206; h) = \u2206 \u2212(cid:0)u(1 \u2212 cos|h|) + y sin|h|(cid:1)u(cid:62)\u2206,\n\nLet \u2206 and h be tangent vectors to G(1, n) at y. The parallel translation of\n\nwhere u = h/|h|. Note that |\u2206| = |pty(\u2206; h)|. If \u2206 = h, it can be further simpli\ufb01ed as\n\npty(h) = h cos|h| \u2212 y|h| sin|h|.\n\n(9)\nNote that BN(z) is not invariant to scaling with negative numbers. That is, BN(\u2212z) = \u2212BN(z). To\nbe precise, there is an one-to-one mapping between the set of weights on which BN(z) is invariant\nand a point on V(1, n), but not on G(1, n). However, the proposed method interprets each weight\nvector as a point on the manifold only when the weight update is performed. As long as the weight\nvector stays in the domain where V(1, n) and G(1, n) have the same invariance property, the weight\nupdate remains equivalent. We prefer G(1, n) since the operators can easily be extended to G(p, n),\nopening up further applications.\n\n3\n\n\f(a) Gradient\n\n(b) Exponential map\n\n(c) Parallel translation\n\nFigure 1: An illustration of the operators on the Grassmann manifold G(1, 2). A 2-by-1 matrix y\nis an orthonormal representation on G(1, 2). (a) A gradient calculated in Euclidean coordinate is\nprojected onto the tangent space TyG(1, 2). (b) y1 = expy(h). (c) h1 = pty(h), |(cid:126)h| = |(cid:126)h1|.\n\n4 Optimization algorithms on G(1, n)\nIn this section, we derive optimization algorithms on the Grassmann manifold G(1, n). The algorithms\ngiven below are iterative algorithms to solve the following unconstrained optimization:\n\nmin\n\ny\u2208G(1,n)\n\nf (y).\n\n(10)\n\n4.1 Stochastic gradient descent with momentum\nThe application of Algorithm 1 to the Grassmann manifold G(1, n) is straightforward. We extend\nthis algorithm to the one with momentum to speed up the training [12]. Algorithm 2 presents the\npseudo-code of the SGD with momentum on G(1, n). This algorithm differs from conventional\nSGD in three ways. First, it projects the gradient onto the tangent space at the point y, as shown in\nFig. 1 (a). Second, it moves the position by the exponential map in Fig. 1 (b). Third, it moves the\nmomentum by the parallel translation of the Grassmann manifold in Fig. 1 (c). Note that if the weight\nis initialized with a unit vector, it remains a unit vector after the update.\nAlgorithm 2 has an advantage over conventional SGD in that the amount of movement is intuitive,\ni.e., it can be measured by the angle between the original point and the new point. As it returns\nto the original point after moving by 2\u03c0 (radian), it is natural to restrict the maximum movement\ninduced by a gradient to 2\u03c0. For \ufb01rst order methods like gradient descent, it would be bene\ufb01cial to\nrestrict the maximum movement even more so that it stays in the range where linear approximation\nThe overall contribution of h is(cid:80)\u221e\nis valid. Let h be the gradient calculated at t = 0. The amount of the \ufb01rst step by the gradient\nof h is \u03b40 = \u03b7 \u00b7 |h| and the contributions to later steps are recursively calculated by \u03b4t = \u03b3 \u00b7 \u03b4t\u22121.\nt=0 \u03b4t = \u03b7 \u00b7 |h|/(1 \u2212 \u03b3). In practice, we found it bene\ufb01cial\nto restrict this amount to less than 0.2 (rad) \u223c= 11.46\u25e6 by clipping the norm of h at \u03bd. For ex-\nample, with initial learning rate \u03b7 = 0.2, setting \u03b3 = 0.9 and \u03bd = 0.1 guarantees this condition.\nAlgorithm 2 Stochastic gradient descent with momentum on G(1, n)\nRequire: learning rate \u03b7, momentum coef\ufb01cient \u03b3, norm_threshold \u03bd\nInitialize y0 \u2208 Rn\u00d71 with a random unit vector\nInitialize \u03c40 \u2208 Rn\u00d71 with a zero vector\nfor t = 1,\u00b7\u00b7\u00b7 , T\ng \u2190 \u2202f (yt\u22121)/\u2202y\nRun a backward pass to obtain g\nh \u2190 g \u2212 (y(cid:62)\nProject g onto the tangent space at yt\u22121\n\u02c6h \u2190 norm_clip(h, \u03bd)\u2020 Clip the norm of the gradient at \u03bd\nd \u2190 \u03b3\u03c4t\u22121 \u2212 \u03b7\u02c6h\nyt \u2190 expyt\u22121\n(d)\n\u03c4t \u2190 ptyt\u22121\nNote that h, \u02c6h, d \u22a5 yt\u22121 and \u03c4t \u22a5 yt where h, \u02c6h, d, yt\u22121, yt \u2208 Rn\u00d71\n\nUpdate delta with momentum\nMove to the new position by the exponential map in Eq. (7)\nMove the momentum by the parallel translation in Eq. (9)\n\n\u2020norm_clip(h, \u03bd) = \u03bd \u00b7 h/|h| if |h| > \u03bd, else h\n\nt\u22121g)yt\u22121\n\n(d)\n\n4\n\n\f4.2 Adam\n\nAdam [9] is a recently developed \ufb01rst-order optimization algorithm based on adaptive estimates\nof lower-order moments that has been successfully applied to training deep neural networks. In\nthis section, we derive Adam on the Grassmann manifold G(1, n). Adam computes the individual\nadaptive learning rate for each parameter. In contrast, we assign one adaptive learning rate to each\nweight vector that corresponds to a point on the manifold. In this way, the direction of the gradient is\nnot corrupted, and the size of the step is adaptively controlled. The pseudo-code of Adam on G(1, n)\nis presented in Algorithm 3.\nIt was shown in [9] that the effective step size of Adam (|d| in Algorithm 3) has two upper bounds.\nThe \ufb01rst occurs in the most severe case of sparsity, and the upper bound is given as \u03b7 1\u2212\u03b21\u221a\nsince the\n1\u2212\u03b22\nprevious momentum terms are negligible. The second case occurs if the gradient remains stationary\nacross time steps, and the upper bound is given as \u03b7. For the common selection of hyperparameters\n\u03b21 = 0.9, \u03b22 = 0.99, two upper bounds coincide. In our experiments, \u03b7 was chosen to be 0.05 and\nthe upper bound was |d| \u2264 0.05 (rad).\nAlgorithm 3 Adam on G(1, n)\nRequire: learning rate \u03b7, momentum coef\ufb01cients \u03b21, \u03b22, norm_threshold \u03bd, scalar \u0001 = 10\u22128\nInitialize y0 \u2208 Rn\u00d71 with a random unit vector\nInitialize \u03c40 \u2208 Rn\u00d71 with a zero vector\nInitialize a scalar v0 = 0\nfor t = 1,\u00b7\u00b7\u00b7 , T\n\n\u03b7t \u2190 \u03b7(cid:112)1 \u2212 \u03b2t\n\n2/(1 \u2212 \u03b2t\n1)\n\nCalculate the bias correction factor\nRun a backward pass to obtain g\nProject g onto the tangent space at yt\u22121\nClip the norm of the gradient at \u03bd\n\ng \u2190 \u2202f (yt\u22121)/\u2202y\nh \u2190 g \u2212 (y(cid:62)\nt\u22121g)yt\u22121\n\u02c6h \u2190 norm_clip(h, \u03bd)\nmt \u2190 \u03b21 \u00b7 \u03c4t\u22121 + (1 \u2212 \u03b21) \u00b7 \u02c6h\nvt \u2190 \u03b22 \u00b7 vt\u22121 + (1 \u2212 \u03b22) \u00b7 \u02c6h(cid:62)\u02c6h\n\u221a\nd \u2190 \u2212\u03b7t \u00b7 mt/\nyt \u2190 expyt\u22121 (d)\n\u03c4t \u2190 ptyt\u22121\n(mt; d)\nNote that h, \u02c6h, mt, d \u22a5 yt\u22121 and \u03c4t \u22a5 yt where h, \u02c6h, mt, d, \u03c4t, yt\u22121, yt \u2208 Rn\u00d71\n\n(vt is a scalar)\nCalculate delta\nMove to the new point by exponential map in Eq. (7)\nMove the momentum by parallel translation in Eq. (8)\n\nvt + \u0001\n\n5 Batch normalization on the product manifold of G(1,\u00b7)\nIn Sec. 3, we have shown that a weight vector used to compute the pre-activation that serves as an\ninput to the BN transform can be naturally interpreted as a point on G(1, n). For deep networks\nwith multiple layers and multiple units per layer, there can be multiple weight vectors that the BN\ntransform is applied to. In this case, the training of neural networks is converted into an optimization\nproblem with respect to a set of points on Grassmann manifolds and the remaining set of parameters.\nIt is formalized as\n\nminX\u2208ML(X ) where M = G(1, n1) \u00d7 \u00b7\u00b7\u00b7 \u00d7 G(1, nm) \u00d7 Rl\n\n(11)\n\nwhere n1 . . . nm are the dimensions of weight vectors, m is the number of the weight vectors on\nG(1,\u00b7) which will be optimized using Algorithm 2 or 3, and l is the number of remaining parameters\nwhich include biases, learnable scaling and offset parameters in BN layers, and other weight matrices.\nAlgorithm 4 presents the whole process of training deep neural networks. The forward pass and\nbackward pass remain unchanged. The only change made is updating the weights by Algorithm 2\nor Algorithm 3. Note that we apply the proposed algorithm only when the input layer to BN is\nunder-complete, that is, the number of output units is smaller than the number of input units, because\nthe regularization algorithm we will derive in Sec. 5.1 is only valid in this case. There should be ways\nto expand the regularization to over-complete layers. However, we do not elaborate on this topic\nsince 1) the ratio of over-complete layers is very low (under 0.07% for wide resnets and under 5.5%\nfor VGG networks) and 2) we believe that over-complete layers are suboptimal in neural networks,\nwhich should be avoided by proper selection of network architectures.\n\n5\n\n\fAlgorithm 4 Batch normalization on product manifolds of G(1,\u00b7)\nDe\ufb01ne the neural network model with BN layers\nm \u2190 0\nfor W = {weight matrices in the network such that W (cid:62)x is an input to a BN layer}\n\nLet W be an n \u00d7 p matrix\nif n > p\n\nfor i = 1,\u00b7\u00b7\u00b7 , p\nm \u2190 m + 1\nAssign a column vector wi in W to ym \u2208 G(1, n)\n\nAssign remaining parameters to v \u2208 Rl\nminL(y1,\u00b7\u00b7\u00b7 , ym, v)\u2020 w.r.t yi \u2208 G(1, ni) for i = 1,\u00b7\u00b7\u00b7 , m and v \u2208 Rl\nfor t = 1,\u00b7\u00b7\u00b7 , T\n\nRun a forward pass to calculate L\nRun a backward pass to obtain \u2202L\nfor i = 1,\u00b7\u00b7\u00b7 , m\n\n\u2202yi\n\nfor i = 1,\u00b7\u00b7\u00b7 , m and \u2202L\n\n\u2202v\n\nUpdate the point yi by Algorithm 2 or Algorithm 3\n\n\u2020 For orthogonality regularization as in Sec. 5.1, L is replaced with L +(cid:80)\n\nUpdate v by conventional optimization algorithms (such as SGD)\n\nW LO(\u03b1, W )\n\n5.1 Regularization using variational inference\n\nIn conventional neural networks, L2 regularization is normally adopted to regularize the networks.\nHowever, it does not work on Grassmann manifolds because the gradient vector of the L2 regulariza-\ntion is perpendicular to the tangent space of the Grassmann manifold. In [13], the L2 regularization\nwas derived based on the Gaussian prior and delta posterior in the framework of variational inference.\nWe extend this theory to Grassmann manifolds in order to derive a proper regularization method in\nthis space.\nConsider the complexity loss, which accounts for the cost of describing the network weights. It is\ngiven by the Kullback-Leibler divergence between the posterior distribution Q(w|\u03b2) and the prior\ndistribution P (w|\u03b1) [13]:\n\nLC(\u03b1, \u03b2) = DKL(Q(w|\u03b2) (cid:107) P (w|\u03b1)).\n\n(12)\n\nFactor analysis (FA) [14] establishes the link between the Grassmann manifold and the space of\nprobabilistic distributions [15]. The factor analyzer is given by\n\np(x) = N (u, C),\n\nC = ZZ(cid:62) + \u03c32I\n\n(13)\nwhere Z is a full-rank n-by-p matrix (n > p) and N denotes a normal distribution. Under the\ncondition that u = 0 and \u03c3 \u2192 0, the samples from the analyzer lie in the linear subspace span(Z). In\nthis way, a linear subspace can be considered as an FA distribution.\nSuppose that n-dimensional p weight vectors y1,\u00b7\u00b7\u00b7 , yp for n > p are in the same layer, which are\nassumed as p points on G(1, n). Let yi be a representation of a point such that y(cid:62)\ni yi = 1. With the\nchoice of delta posterior and \u03b2 = [y1,\u00b7\u00b7\u00b7 , yp], the corresponding FA distribution can be given by\nq(x|Y ) = N (0, Y Y (cid:62) + \u03c32I), where Y = [y1,\u00b7\u00b7\u00b7 , yp] with the subspace condition \u03c3 \u2192 0. The\nFA distribution for the prior is set to p(x|\u03b1) = N (0, \u03b1I) that depends on the hyperparameter \u03b1.\nSubstituting the FA distribution of the prior and posterior into Eq. (12) gives the complexity loss\n\n(cid:0)q(x|Y ) (cid:107) p(x|\u03b1)(cid:1).\n\nLC(\u03b1, Y ) = DKL\n\n(14)\n\nEq. (14) is minimized when the column vectors of Y are orthogonal to each other (refer to Appendix A\nfor details). That is, minimizing LC(\u03b1, Y ) will maximally scatter the points away from each other\non G(1, n). However, it is dif\ufb01cult to estimate its gradient. Alternatively, we minimize\n\n(15)\nwhere (cid:107) \u00b7 (cid:107)F is the Frobenius norm. It has the same minimum as the original complexity loss and the\nnegative of its gradient is a descent direction of the original loss (refer to Appendix B).\n\nLO(\u03b1, Y ) =\n\nF\n\n(cid:107) Y (cid:62)Y \u2212 I (cid:107)2\n\n\u03b1\n2\n\n6\n\n\f6 Experiments\n\nWe evaluated the proposed learning algorithm for image classi\ufb01cation tasks using three benchmark\ndatasets: CIFAR-10 [16], CIFAR-100 [16], and SVHN (Street View House Number) [17]. We used\nthe VGG network [18] and wide residual network [2, 19, 20] for experiments. The VGG network\nis a widely used baseline for image classi\ufb01cation tasks, while the wide residual network [2] has\nshown state-of-the-art performance on the benchmark datasets. We followed the experimental setups\ndescribed in [2] so that the performance of algorithms can be directly compared. Source code is\npublicly available at https://github.com/MinhyungCho/riemannian-batch-normalization.\nCIFAR-10 is a database of 60,000 color images in 10 classes, which consists of 50,000 training\nimages and 10,000 test images. CIFAR-100 is similar to CIFAR-10, except that it has 100 classes\nand contains fewer images per class. For preprocessing, we normalized the data using the mean\nand variance calculated from the training set. During training, the images were randomly \ufb02ipped\nhorizontally, padded by four pixels on each side with the re\ufb02ection, and a 32\u00d732 crop was randomly\nsampled. SVHN [17] is a digit classi\ufb01cation benchmark dataset that contains 73,257 images in the\ntraining set, 26,032 images in the test set, and 531,131 images in the extra set. We merged the extra\nset and the training set in our experiment, following the step in [2]. The only preprocessing done was\nto divide the intensity by 255.\nDetailed architectures for various VGG networks are described in [18]. We used 512 neurons in\nfully connected layers rather than 4096 neurons, and the BN layer was placed before every ReLU\nactivation layer. The learnable scaling parameter in the BN layer was set to one because it does not\nreduce the expressive power of the ReLU layer [21]. For SVHN experiments using VGG networks,\nthe dropout was applied after the pooling layer with dropout rate 0.4. For wide residual networks,\nwe adopted exactly the same model architectures in [2], including the BN and dropout layers. In all\ncases, the biases were removed except the \ufb01nal layer.\nFor the baseline, the networks were trained by SGD with Nesterov momentum [22]. The weight\ndecay was set to 0.0005, momentum to 0.9, and minibatch size to 128. For CIFAR experiments, the\ninitial learning rate was set to 0.1 and multiplied by 0.2 at 60, 120, and 160 epochs. It was trained for\na total of 200 epochs. For SVHN, the initial learning rate was set to 0.01 and multiplied by 0.1 at 60\nand 120 epochs. It was trained for a total of 160 epochs.\nFor the proposed method, we used different learning rates for the weights in Euclidean space and\non Grassmann manifolds. Let us denote the learning rates for Euclidean space and Grassmann\nmanifolds as \u03b7e and \u03b7g, respectively. The selected initial learning rates were \u03b7e = 0.01, \u03b7g = 0.2 for\nAlgorithm 2 and \u03b7e = 0.01, \u03b7g = 0.05 for Algorithm 3. The same initial learning rates were used\nfor all CIFAR experiments. For SVHN, they were scaled by 1/10, following the same ratio as the\nbaseline [2]. The training algorithm for Euclidean parameters was identical to the one used in the\nbaseline with one exception. We did not apply weight decay to scaling and offset parameters of BN,\nwhereas the baseline did as in [2]. To clarify, applying weight decay to mean and variance parameters\nof BN was essential for reproducing the performance of baseline. The learning rate schedule was\nalso identical to the baseline, both for \u03b7e and \u03b7g. The threshold for clipping the gradient \u03bd was set to\n0.1. The regularization strength \u03b1 in Eq. (15) was set to 0.1, which gradually achieved near zero LO\nduring the course of the training, as shown in Fig. 2.\n\nFigure 2: Changes in LO in\nEq. (15) during training for vari-\nous \u03b1 values (y-axis on the left).\nThe red dotted line denotes the\nlearning rate (\u03b7g, y-axis on the\nright). VGG-11 was trained by\nSGD-G on CIFAR-10.\n\n6.1 Results\n\nTables 1 and 2 compare the performance of the baseline SGD and two proposed algorithms described\nin Sec. 4 and 5, on CIFAR-10, CIFAR-100, and SVHN datasets. All the numbers reported are the\nmedian of \ufb01ve independent runs. In most cases, the networks trained using the proposed algorithms\n\n7\n\n\f(a) CIFAR-10\n\n(b) CIFAR-100\n\n(c) SVHN\n\nFigure 3: Training curves of the baseline and proposed optimization methods. (a) WRN-28-10 on\nCIFAR-10. (b) WRN-28-10 on CIFAR-100. (c) WRN-22-8 on SVHN.\n\noutperformed the baseline across various datasets and network con\ufb01gurations, especially for the ones\nwith more parameters. The best performance was 3.72% (SGD and SGD-G) and 17.85% (ADAM-G)\non CIFAR-10 and CIFAR-100, respectively, with WRN-40-10; and 1.55% (ADAM-G) on SVHN\nwith WRN-22-8.\nTraining curves of the baseline and proposed methods are presented in Figure 3. The training curves\nfor SGD suffer from instability or experience a plateau after each learning rate drop, compared to\nthe proposed methods. We believe that this comes from the inverse proportionality of the gradient to\nthe norm of BN weight parameters (as in Eq. (2)). During the training process, this norm is affected\nby weight decay, hence the magnitude of the gradient. It is effectively equivalent to disturbing the\nlearning rate by weight decay. The authors of wide resnet also observed that applying weight decay\ncaused this phenomena, but weight decay was indispensable for achieving the reported performance\n[2]. Proposed methods resolve this issue in a principled way.\nTable 3 summarizes the performance of recently published algorithms on the same datasets. We\npresent the best performance of \ufb01ve independent runs in this table.\n\nTable 1: Classi\ufb01cation error rate of various networks on CIFAR-10 and CIFAR-100 (median of \ufb01ve\nruns). VGG-l denotes a VGG network with l layers. WRN-d-k denotes a wide residual network that\nhas d convolutional layers and a widening factor k. SGD-G and Adam-G denote Algorithm 2 and\nAlgorithm 3, respectively. The results in parenthesis show those reported in [2].\nCIFAR-100\n\nCIFAR-10\n\nDataset\nModel\nVGG-11\nVGG-13\nVGG-16\nVGG-19\nWRN-52-1\nWRN-16-4\nWRN-28-10\nWRN-40-10\u2020\n\u2020This model was trained on two GPUs. The gradients were summed from two minibatches of size 64, and\nBN statistics were calculated from each minibatch.\n\nSGD-G\n28.02\n25.29\n25.64\n25.79\n28.13\n24.51\n18.19\n18.04\n\n27.44 (29.78)\n23.41 (23.91)\n18.66 (18.85)\n18.39 (18.3)\n\nAdam-G\n\n28.05\n24.89\n25.29\n25.59\n28.16\n24.24\n18.30\n17.85\n\n7.59\n6.05\n5.98\n6.02\n6.58\n5.28\n3.78\n3.80\n\nSGD\n29.25\n26.17\n26.84\n27.62\n\nSGD-G Adam-G\n\nSGD\n7.43\n5.88\n6.32\n6.49\n\n6.23 (6.28)\n4.96 (5.24)\n3.89 (3.89)\n3.72 (3.8)\n\n7.14\n5.87\n5.88\n5.92\n6.56\n5.35\n3.85\n3.72\n\n7 Conclusion and discussion\nWe presented new optimization algorithms for scale-invariant vectors by representing them on G(1, n)\nand following the intrinsic geometry. Speci\ufb01cally, we derived SGD with momentum and Adam\nalgorithms on G(1, n). An ef\ufb01cient regularization algorithm in this space has also been proposed.\nApplying them in the context of BN showed consistent performance improvements over the baseline\nBN algorithm with SGD on CIFAR-10, CIFAR-100, and SVHN datasets.\n\n8\n\n\fTable 2: Classi\ufb01cation error rate of various networks on SVHN (median of \ufb01ve runs).\n\nSGD-G Adam-G\n\nModel\nVGG-11\nVGG-13\nVGG-16\nVGG-19\nWRN-52-1\nWRN-16-4\nWRN-16-8\nWRN-22-8\n\nSGD\n2.11\n1.78\n1.85\n1.94\n\n1.68 (1.70)\n1.64 (1.64)\n1.60 (1.54)\n\n1.64\n\n2.14\n1.72\n1.76\n1.77\n1.67\n1.61\n1.68\n1.55\n\n2.10\n1.74\n1.76\n1.81\n1.72\n1.67\n1.69\n1.63\n\n7.47\n6.55\n6.37\n6.05\n4.91\n4.62\n3.8\n3.491\n\nTable 3: Performance comparison with previously published results.\nSVHN\n1.88\n\nCIFAR-10 CIFAR-100\n\nNormProp [7]\n\nMethod\n\nELU [23]\n\nScalable Bayesian optimization [24]\n\nGeneralizing pooling [25]\n\nStochastic depth [26]\n\nResNet-1001 [20]\n\nWide residual network [2]\nProposed (best of \ufb01ve runs)\n\n29.24\n24.28\n27.4\n\n-\n\n24.98\n22.71\n18.3\n17.592\n\n-\n-\n\n-\n\n1.69\n1.75\n\n1.54\n1.493\n\n1WRN-40-10+SGD-G 2WRN-40-10+Adam-G 3WRN-22-8+Adam-G\n\nOur work interprets each scale invariant piece of the weight matrix as a separate manifold, whereas\nnatural gradient based algorithms [27, 28, 29] interpret the whole parameter space as a manifold and\nconstrain the shape of the cost function (i.e. to the KL divergence) to obtain a cost ef\ufb01cient metric.\nThere are similar approaches to ours such as Path-SGD [30] and the one based on symmetry-invariant\nupdates [31], but the comparison remains to be done.\nProposed algorithms are computationally as ef\ufb01cient as their non-manifold versions since they do not\naffect the forward and backward propagation steps, where majority of the computation takes place.\nThe weight update step is 2.5-3.5 times more expensive, but still O(n).\nWe did not explore the full range of possibilities offered by the proposed algorithm. For example,\ntechniques similar to BN, such as weight normalization [6] and normalization propagation [7], have\nscale invariant weight vectors and can bene\ufb01t from the proposed algorithm in the same way. Layer\nnormalization [8] is invariant to weight matrix rescaling, and simple vectorization of the weight\nmatrix enables the application of the proposed algorithm.\n\n9\n\n\fReferences\n\n[1] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pages\n448\u2013456, 2015.\n\n[2] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.\n[3] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the\ninception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 2818\u20132826, 2016.\n\n[4] Alan Edelman, Tom\u00e1s A Arias, and Steven T Smith. The geometry of algorithms with orthogonality\n\nconstraints. SIAM journal on Matrix Analysis and Applications, 20(2):303\u2013353, 1998.\n\n[5] P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds.\n\n[6] Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate\nIn Advances in Neural Information Processing Systems 29, pages\n\nPrinceton University Press, 2009.\n\ntraining of deep neural networks.\n901\u2013909, 2016.\n\n[7] Devansh Arpit, Yingbo Zhou, Bhargava Kota, and Venu Govindaraju. Normalization propagation: A\nparametric technique for removing internal covariate shift in deep networks. In Proceedings of The 33rd\nInternational Conference on Machine Learning, pages 1168\u20131176, 2016.\n\n[8] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\n[9] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\narXiv:1412.6980, 2014.\n\n[10] P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Riemannian geometry of grassmann manifolds with\n\na view on algorithmic computation. Acta Applicandae Mathematicae, 80(2):199\u2013220, 2004.\n\n[11] M.P. do Carmo. Differential Geometry of Curves and Surfaces. Prentice-Hall, 1976.\n[12] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-\n\npropagating errors. Nature, 323(6088):533\u2013536, 10 1986.\n\n[13] Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information\n\nProcessing Systems 24, pages 2348\u20132356, 2011.\n\n[14] Zoubin Ghahramani, Geoffrey E Hinton, et al. The EM algorithm for mixtures of factor analyzers.\n\nTechnical report, Technical Report CRG-TR-96-1, University of Toronto, 1996.\n\n[15] Jihun Hamm and Daniel D Lee. Extended grassmann kernels for subspace-based learning. In Advances in\n\nNeural Information Processing Systems 21, pages 601\u2013608, 2009.\n\n[16] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Master\u2019s\n\nthesis, Department of Computer Science, University of Toronto, 2009.\n\n[17] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in\nnatural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised\nfeature learning, 2011.\n\n[18] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778,\n2016.\n\n[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.\n\nIn European Conference on Computer Vision, pages 630\u2013645. Springer, 2016.\n\n[21] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models.\n\narXiv preprint arXiv:1702.03275, 2017.\n\n[22] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and\nmomentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning,\npages 1139\u20131147, 2013.\n\n[23] Djork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning\n\nby exponential linear units (ELUs). arXiv preprint arXiv:1511.07289, 2015.\n\n[24] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa\nPatwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In\nInternational Conference on Machine Learning, pages 2171\u20132180, 2015.\n\n[25] Chen-Yu Lee, Patrick W Gallagher, and Zhuowen Tu. Generalizing pooling functions in convolutional\nneural networks: Mixed, gated, and tree. In International conference on arti\ufb01cial intelligence and statistics,\n2016.\n\n[26] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic\n\ndepth. In European Conference on Computer Vision, pages 646\u2013661. Springer, 2016.\n\n[27] Shun-Ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276,\n\n1998.\n\n10\n\n\f[28] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. In International\n\nConference on Learning Representations, 2014.\n\n[29] James Martens and Roger B Grosse. Optimizing neural networks with kronecker-factored approximate\ncurvature. In Proceedings of The 32nd International Conference on Machine Learning, pages 2408\u20132417,\n2015.\n\n[30] Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-SGD: Path-normalized optimization\nin deep neural networks. In Advances in Neural Information Processing Systems, pages 2422\u20132430, 2015.\n[31] Vijay Badrinarayanan, Bamdev Mishra, and Roberto Cipolla. Understanding symmetries in deep networks.\n\narXiv preprint arXiv:1511.01029, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2686, "authors": [{"given_name": "Minhyung", "family_name": "Cho", "institution": "Gracenote"}, {"given_name": "Jaehyung", "family_name": "Lee", "institution": "Gracenote"}]}