UPC/UPC++ for parallel programming and computing
UPC++: a PGAS library for high performance computing
UPC++ is a library that implements the Asynchronous PGAS model. We are revising the library under the auspices of the DOE's Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and closely resemble those used in conventional C++.
The key abstractions in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, and asynchronous programming via RPC, also known as function shipping, and futures. Futures enable the programmer to capture data readiness state, which is useful in making scheduling decisions, or to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.
The UPC++ programmer can expect communication to run at close to hardware speeds. To this end, UPC++ runs atop the GASNet communication library and takes advantage of GASNet's low-overhead communication as well as access to any special hardware support, e.g. RDMA.
Most of the current programming models are based on the premise that computing is the most expensive component. For example, at the application/algorithm level, the time complexity of an algorithm is analyzed by counting the number of arithmetic operations and ignoring the cost of data movement. At the system level, prominent scheduling mechanisms such as work-stealing aim at keeping all the computing units (i.e., cores) active and ignore the additional cost of data movement caused by several active cores sharing caches.
However, we are entering a big-data era where computing is cheap and massively parallel while data movement dominates performance and energy costs. In order to utilize the next generation of high performance computing systems (i.e., exascale systems), programming models need a paradigm shift from compute-centric to data-
In this talk, I will discuss the possibility to incorporate the data-centric aspect into the partitioned global address space (PGAS) programming paradigm, particularly UPC++. I will present our preliminary results on data-centric UPC++.
Light scattering at nanoparticles. Can parallel computing give us better solar cells?
Solar cells are one of the best green energy sources as they convert sunlight directly to electricity. Although their price is constantly falling, we still have high production costs. In order to reduce cost, researchers are aiming for thinner solar cells. But less material means reduction in efficiency. To keep efficiency high and the production cost low, researchers are adding nanostructures on the top of the thin solar cells. But what is the best arrangement of nanostructures which gives the highest efficiency? In order to investigate this we need the tools of theoretical physics. With computer experiments we can understand how light is going around and inside the nanostructures. In our FRINATEK project we do computer simulations where we study the scattering of light on different nanostructures. These computer simulations are done on high performance computer. In order to simulate more and more realistic structures, we need to increase the simulation grid. This has limitations today in memory and computational time as our codes run on single nodes. In this talk I would like to investigate if parallel computing can overcome the limitations in memory and computation time of our codes.
Performance optimisation and modeling of UPC code that involves fine-grain communication
UPC, as one of the most widely-used PGAS languages, has several inherent user-friendly features in its design. These nice features relieve the programmers of the burden of e.g. working explicitly with inter-thread data movement. However, they may also bring performance penalties, especially to programs that incur fine-grain communication between threads. We show that it is important to avoid global shared array pointers, and instead replace them with private copies and aggregated inter-thread data exchanges. Moreover, we present our work on modeling the performance of several UPC implementations, for the purpose of shedding light on why and when UPC's shared array pointers will become prohibitively time-consuming to use.