Please use this identifier to cite or link to this item: `https://hdl.handle.net/1822/82753`

DC FieldValueLanguage
dc.contributor.authorSilva, Filipe Pereira dapor
dc.date.accessioned2023-02-17T11:02:34Z-
dc.date.available2023-02-17T11:02:34Z-
dc.date.issued2021-10-27-
dc.date.submitted2021-11-
dc.identifier.urihttps://hdl.handle.net/1822/82753-
dc.description.abstractThe 2D convection-diffusion is a well-known problem in scientific simulation that often uses a direct method to solve a system of N linear equations, which requires N3 operations. This problem can be solved using a more efficient computational method, known as the alternating direction implicit (ADI). It solves a system of N linear equations in 2N times with N operations each, implemented in two steps, one to solve row by row, the other column by column. Each N operation is fully independent in each step, which opens an opportunity to an embarrassingly parallel solution. This method also explores the way matrices are stored in computer memory, either in row-major or column-major, by splitting each iteration in two. The major bottleneck of this method is solving the system of linear equations. These systems of linear equations can be described as tridiagonal matrices since the elements are always stored on the three main diagonals of the matrices. Algorithms tailored for tridiagonal matrices, can significantly improve the performance. These can be sequential (i.e. the Thomas algorithm) or parallel (i.e. the cyclic reduction CR, and the parallel cyclic reduction PCR). Current vector extensions in conventional scalar processing units, such as x86-64 and ARM devices, require the vector elements to be in contiguous memory locations to avoid performance penalties. To overcome these limitations in dot products several approaches are proposed and evaluated in this work, both in general-purpose processing units and in specific accelerators, namely NVidia GPUs. Profiling the code execution on a server based on x86-64 devices showed that the ADI method needs a combination of CPU computation power and memory transfer speed. This is best showed on a server based on the Intel manycore device, KNL, where the algorithm scales until the memory bandwidth is no longer enough to feed all 64 computing cores. A dual-socket server based on 16-core Xeon Skylakes, with AVX-512 vector support, proved to be a better choice: the algorithm executes in less time and scales better. The introduction of GPU computing to further improve the execution performance (and also using other optimisation techniques, namely a different thread scheme and shared memory to speed up the process) showed better results for larger grid sizes (above 32Ki x 32Ki). The CUDA development environment also showed a better performance than using OpenCL, in most cases. The largest difference was using a hybrid CR-PCR, where the OpenCL code displayed a major performance improvement when compared to CUDA. But even with this speedup, the better average time for the ADI method on all tested configurations on a NVidia GPU was using CUDA on an available updated GPU (with a Pascal architecture) and the CR as the auxiliary method.por
dc.language.isoengpor
dc.rightsopenAccesspor
dc.subjectMaster thesispor
dc.subjectGPU Computingpor
dc.subjectPhysicspor
dc.subjectHPCpor
dc.subjectMathematicspor
dc.subjectComputação em GPUpor
dc.subjectFísicapor
dc.subjectMatemáticapor
dc.titleParallelization of the ADI method exploring vector computing in GPUspor
dc.typemasterThesiseng
dc.identifier.tid203150538por