(The actual problem here: I have a class of 10 people and I want to split them into 5 pairs for a duration of 13 weeks, but I don't want anyone to be in a repeat pairing until they have been in a pairing with everyone in the class.) I'd like no element in the large set to be in a repeating group until they have been in a small group with every one else. We show that the communication-avoiding variants reduce the number of synchronizations by a factor of $$s$$ on distributed-memory parallel machines without altering the convergence rate and attain strong scaling speedups of up to $$6.1\times$$ over the “standard algorithm" on a Cray XC30 supercomputer.I have a large set of size M (let's say 10), and I want to, repeatedly for a certain number of occasions (let's say 13), randomly split it into M/N smaller groups of size N (let's say 2). We show how applying similar algorithmic transformations can lead to primal and dual block coordinate descent methods that only communicate every $$s$$ iterations-where $$s$$ is a more » tuning parameter-instead of every iteration for the regularized least-squares problem. Recent results on communication-avoiding Krylov subspace methods suggest that large speedups are possible by re-organizing iterative algorithms to avoid communication. However, existing implementations communicate at every iteration, which, on modern data center and supercomputing architectures, often dominates the cost of floating-point computation. Distributed-memory parallel implementations of these methods have become popular in analyzing large machine learning datasets. Primal and dual block coordinate descent methods are iterative methods for solving regularized and unregularized optimization problems. The linearity test combines the Bernstein-Vazirani algorithm and amplitude amplification, while the test to determine whether a function is symmetric uses projective measurements and amplitude amplification. In addition, in the case of linearity testing, if the function is linear, the quantum algorithm identifies which linear function it is. This paper explains the good behavior of RPCD with a tight = calls to the oracle, which is more » better than known classical algorithms. The RPCD approach performs well on these functions, even better than RCD in a certain regime. arXiv:1604.07130) has explored the poor behavior of CCD on functions of this type. Stanford, CA: Department of Management Science and Engineering, Stanford University. There is a certain type of quadratic function for which CCD is significantly slower than for RCD a recent paper by Sun & Ye (2016, Worst-case complexity of cyclic coordinate descent: $O(n^2)$ gap with randomized version. Known convergence guarantees are weaker for CCD and RPCD than for RCD, though in most practical cases, computational performance is similar among all these variants. Three common orderings are cyclic (CCD), in which we cycle through the components of $$x$$ in order randomized (RCD), in which the component to update is selected randomly and independently at each iteration and random-permutations cyclic (RPCD), which differs from CCD only in that a random permutation is applied to the variables at the start of each cycle. Abstract Variants of the coordinate descent approach for minimizing a nonlinear function are distinguished in part by the order in which coordinates are considered for relaxation.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |