Abstract:Sparse triangular solver (SpTRSV) is an important computation kernel in scientific computing. The irregular memory access pattern of SpTRSV makes efficient data reuse difficult to achieve. Structured grid problems possess special nonzero patterns. On SW26010 processor, the major building block of Sunway Taihulight supercomputer, these patterns are often exploited during the task partitioning stage to facilitate on-chip reuse of computed unknowns; Software-based routing is usually employed to implement inter-thread communication. Routing incurs overhead and imposes certain restrictions on nonzero patterns. In this paper, we achieve on-chip data reuse without routing. The input problem is partitioned and mapped onto SW26010 such that threads with data dependencies are always connected by the register communication network. This enables direct thread communication and obviates routing. We describe our solver and test it over a variety of problems. In the experiments, our solver sustains an average memory bandwidth utilization of 88.1% with peak efficiency reaching 94.5% (24.5GB/s).