CUDA profiling and optimization
This project’s goal is to carry out a parallel bitonic sort in addition to completing a parallel prefix scan in order to gather information. This purpose also serves as one of the project’s goals, thus it is important to keep it in mind. The goal of this project is to improve our CUDA skills so that we can successfully perform the responsibilities at hand. Regardless of the type of file being read in, the program should be able to sort and scan any input file with a size of up to 220 rows. There is a limit of 220 rows that can be sorted at one time. The implementation should be done in stages, beginning with perfecting the scan, then perfecting the sort, adapting the sort to sort a key from a larger structure, and finally putting it all together in one program along with code for reading the input csv file and producing the output csv file. The stages should be completed in this order: perfecting the scan, perfecting the sort, adapting the sort to sort a key from a larger structure, and perfecting the scan. It is recommended that the following sequence be followed when completing the stages: perfecting the scan, perfecting the sort, changing the sort to sort a key from a bigger structure, and then perfecting the scan. While finishing the stages, it is recommended that the following order be followed: perfecting the scan, perfecting the sort, modifying the sort to sort a key from a larger structure, and then perfecting the scan. This will ensure that the stages are completed successfully. It is recommended that the following order be followed while finishing the stages: perfecting the scan, perfecting the sort, adjusting the sort to sort a key from a broader structure, and then perfecting the scan. This will ensure that the stages are completed in the correct order. This will guarantee that the levels are successfully completed in their entirety. It is recommended that this process be carried out in such a way that first the scan is improved, then the sort, and finally the adaptation of the sort to sort a key from a more general structure. This is the recommended order of operations for carrying out this procedure. This is the correct order in which to complete these steps. It is of the utmost importance that these stages be finished in the specific order that is specified in the following paragraphs; failing to do so could have catastrophic consequences. Scan: When using the parallel dissemination technique, it is possible to carry out a parallel prefix scan. This is made possible thanks to the use of parallelism. The reference implementation required to be updated so that it could manage two tiers of data in order to allow the administration of up to 1048*1048 elements. This was necessary in order to meet the requirements. This was achieved by releasing multiple updates throughout the course of time. A compilation of the most recent entry from each leaf block is included at the most advanced level. This is followed by an individual examination of the data. The next level, which is far more difficult, will follow this one. The prefixes that are necessary for each leaf block have been shifted to the topmost point in the structure’s most recent layer, which was the layer that was added most recently. It was decided to deploy an additional kernel in order to ensure that the prefixes were correctly placed into each block. This was done in order to assure the success of the operation. This strategy is compatible with non-powers-of-two, and it has been further expanded to contain a number of tier-two blocks, in addition to a third layer that ties all of these tier-two blocks together. In addition to this, the method works with numbers that are not powers of two. Sort: A process that is known as bitonic merging is utilized in the running of the parallel botanic sorting that is carried out. The functionality of the reference implementation’s outer loops was adjusted so that it now works with greater j and k values. This change was made possible thanks to the addition of a new feature. Because of the introduction of a brand new function, this modification was finally doable. Therefore, the j loop should not return to the host until the value of j has grown to be too large to fit entirely within the block. The outer k loop is completely contained within the block. Before the j loop may proceed with its return to the host, this condition must first be satisfied. When dealing with lesser j numbers, the loop will be stored locally on the device, and synchronization will be achieved through the employment of the __syncthreads() function. Adapt to sort a key from a larger structure: It is now possible to obtain a key from a structure that is more intricate as a result of the adjustments that have been made to the software. As a direct result of these adjustments, the usefulness of the software has grown significantly. Following the completion of the previous phase, which consisted of successfully copying all of the X and Y data structures onto the device, the following step consisted of arranging them in a descending sequence according to the x-value key. At the end of the day, it was determined that the most prudent action to take would be to put this strategy into action and see what the results are. Putting it all together: All of the components were integrated into a single program, along with the code required to read the CSV file that was used as input and to generate the CSV file that was used as output. This code was also included in the program. This piece of code was incorporated into the application as well. This action was carried out in order to make the program more successful in its intended purpose. No matter how many rows are contained in each input file, the application is able to reliably process data from files of any size up to 220 rows in length, regardless of the number of rows in each file.. Profiling: The program was profiled using NVIDIA’s profiling tool, nvprof, to identify bottlenecks and optimize the code. The results showed that the majority of the time was spent in the bitonic sort kernel. The program was optimized by using shared memory and loop unrolling to reduce the execution time of the bitonic sort kernel. Conclusion: The CUDA implementation of parallel prefix scan and parallel bitonic sort can efficiently handle large input file sizes up to 220 rows. The program was incrementally implemented, starting with the scan, then the sort, adapting to sort a key from a larger structure, and finally putting it all together in one program. The program was profiled and optimized using NVIDIA’s profiling tool, nvprof, to reduce execution time.