IBM Research

GREPS 2007 Workshop


Call for Papers and Participation

GREPS: International Workshop on GCC for Research in Embedded and Parallel Systems
in conjunction with
16th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Brasov, Romania
September 16, 2007

Keynote: GCC in Software Performance Research: Just Plug In

Paul H J Kelly

GCC is a unique opportunity to link the research community with industrial delivery of compiler technology. This talk reports our experience with several projects which illustrate the importance and value of GCC for researchers, and show the different ways we have engaged with the GCC community. GCC is a commercial-quality compiler, and thus a route for delivery of new compiler technology from the research lab into the hands of users. It was not developed as an infrastructure for research - but it offers a unique framework, being a full implementation of several major languages, and because it both includes (increasingly) sophisticated analyses and optimizations, and operates under the constraints of an industrial-scale development effort. This talk reviews several of our GCC-based projects, including field-sensitive pointer analysis, bounds checking, and run-time code generation. We reflect on the opportunity, and the research challenges, in developing GCC to make good on this promise.

Keynote: GCC for Embedded VLIW Processors: Why Not?

Benoit Dupont de Dinechin

Embedded VLIW processors such as the Texas~Instruments C6000 VLIW-DSP family and the STMicroelectronics ST200 VLIW-media family are increasingly used in consumer, mobile and automotive electronics. Unlike the previous generation of DSPs, these high-performance processors are as compiler friendly as a modern VLIW can be. However, the production compilers for these processors are not GCC-based yet.

Based on the STMicroelectronics experience in developing production code generators for the ST120 VLIW-DSP and the ST200 VLIW-media processors, we discuss the reasons the current GCC compilers are not seen competitive with the proprietary compilers in case of embedded VLIW processors. This contrasts with the case of RISC-like embedded processors such as the ST40 (SH4) or the ARM family members, where STMicroelectronics contributes to the corresponding GCC ports.

Besides understanding user-level annotations and performing some tailored loop transformations, the primary contribution to the superior performance of proprietary compilers comes from their code generators. Code generation for embedded VLIW processors require that partial or fully predicated instruction sets be effectively exploited. Another requirement of advanced code generation is the support of the SSA form on machine code. Finally, near-optimal instruction scheduling and software pipelining techniques based on integer programming make a difference.

Gcov on an Embedded System

Holger Blasum, Frank Gorgen, and Jurgen Urban

We describe how gcov (based on version gcc 3.4.4) was used in an embedded system (PowerPC architecture). Unlike the gcov kernel analysis of the linux test project, our modifications do not use any file system access during data collection time. It is also shown how a one-line modification may lead towards a conservative estimation of coverage in a multithreaded setting.

Improving a Selective Scheduling Approach for GCC

Andrey Belevantsev, Dmitry Melnik, and Arutyun Avetisyan

During the last two years, we are working on a project on implementing an aggressive interblock scheduler for GCC, using the selective scheduling approach as a basis. The scheduler supports a number of instruction transformations, software pipelining, and IA-64 speculation capabilities. The implementation is available on the sel-sched branch in the GCC repository.

While working on the project, we have found a number of deficiencies in the original approach, which either affect the resulting performance of the scheduler or complicate matters in the implementation developed for a production compiler. This paper discusses the problems found and suggests solutions to them that either proved themselves useful in GCC or are going to be implemented there. We concentrate on problems with using the choosing heuristics in the selective scheduling and on compile-time issues.

Interprocedural Analysis of Aggregates

Martin Jambor

Currently, the main method of making contents of aggregate members available to other optimizations is by scalar replacement which requires the aggregate is allocated on stack and is its address is not shared among different functions. The latter condition can sometimes be fulfilled by inlining but the number of aggregates which need to live in memory is big even when inlining aggressively. This paper discusses design and implementation of an interprocedural analysis of aggregates which is capable of identifying aggregates which are used in a simple way so that they do not have any aliases. Moreover, the exact way how references to such aggregates are passed in between functions is also determined. We have also proposed interprocedural propagation of constants within aggregates on top of this analysis to demonstrate its usage. In the experiments we have carried out, the analysis was able to detect that about 20\% of aggregates and pointers and references to them were used simply enough and thus can be reasoned about easily.

Incremental Machine Descriptions for GCC

Sameera Deshpande and Uday P. Khekdker

The mechanism of providing machine descriptions to the GCC framework has been quite successful as demonstrated by the wide variety of the targets for which a GCC port exists. However, this mechanism is quite adhoc and the machine descriptions are difficult to construct, understand, maintain, and enhance because of the verbosity, the amount of details, and the repetitiveness. The publicly available material fails to bring out the exact abstractions captured by the machine descriptions. There is not systematic way of constructing machine descriptions and there are no clear guidelines on where to begin developing machine description and how to construct them systematically.

This paper proposes a methodology based on incremental construction of machine description starting from a well defined minimal machine descriptions. We illustrate the process by constructing machine descriptions for the spim simulator for the MIPS architecture.

An Approach for Data Propagation from Tree SSA to RTL

Dmitry Melnik, Sergey Gaissaryan, Alexander Monakov, and Dmitry Zhurikhin

This paper describes an approach for propagation of information collected during Tree SSA optimization passes to RTL level to enable RTL-passes to take advantage of the information available on Tree SSA, which is generally more precise than that available on RTL. At the moment the propagated information includes data dependence and may alias information. In the paper we propose an approach for such propagation and discuss problems that arise during the implementation.

What you can do with Structure in Compiler

Olga Golovanevsky

This paper presents structure layout reorganization optimizations in GCC.
Starting from basic optimizations, such as structure reordering, peeling and splitting, they have evolved to target more complex data structure (link-lists, trees, graphs, etc.) built from individual structures interconnected by pointers. The focuss of this paper is the peeling plus indexing optimization, that combines structure peeling with replacing pointers by their indices in an array. On the basis of the mcf benchmark we compare the efficiency of different types of struct-reorg optimizations. We discuss those features of the GCC infrastructure that were critical for development of these optimizations:

Tree-SSA form, interprocedural analysis infrastructure ipa, call-graph and ipa-type-escape analysis. Finally, the performance results gained by struct-reorg optimizations will be presented.

Rational for GCC's Link Time Optimization Implementation

Kenneth Zadeck

Link Time Optimization (LTO), is a process where the compilation of the modules contained in an application is deferred until link time. In the conventional compiler, each module (or source file) is compiled to a single object file. Each of the object files from each of the modules is then linked together to form the executable code for the application. In LTO, each module is also compiled to a single object file, but those object files also contain a high level representation of the program. The linker then combines these high level representations and then passes them back to the compiler where the entire application can be optimized at once.

LTO makes many transformations possible that would otherwise not be productive because of the pessimistic assumptions that must be made about parts of the program that are unseen by a module at a time compilation process.

This paper discusses many of the issues that the Gnu Compiler Collection (GCC) must face as it transitions from an module at a time compiler to a LTO compiler.

Split Compilation: an Application to Just-in-Time Vectorization

Piotr Lesnicki, Albert Cohen, Grigori Fursin, Marco Cornero, Andrea Ornstein, and Erven Rohou

In a world of ubiquitous, heterogeneous parallelism, achieving portable performance is a challenge. It requires finely tuned coordination, from the programming language to the hardware, through the compiler and multiple layers of the run-time system. This document presents our work in split compilation and parallelization. Split compilation relies on automatically generated semantical annotations to enrich the intermediate format, decoupling costly offline analyses from lighter, online or just-in-time program transformations.

Our work focuses on automatic vectorization, a key optimization playing and increasing role in modern, power-efficient architectures. Our research platform uses GCC's support for the Common Language Infrastructure (CLI ECMA-335); this choice is motivated by the unique combination of optimizations and portability of GCC, through a semantically rich and performance-friendly intermediate format. Implementation is still in progress.

Register Allocation Techniques for iVMX Architecture

Mircea Namolaru

iVMX is a novel architecture with a big register file accessed via a mapping mechanism. In this paper, we present different local register allocation and mapping algorithms for this architecture, using the FIR filter as an example and benchmark.

We identified a pattern of register ranges, that allows efficient register allocation for iVMX. For FIR we succeed to significantly reduce the overhead required for mapping.

Graphite: Towards a Declarative Polyhedral Representation

Sebastian Pop

Classical polyhedral representations of imperative languages entangle untranslated scalar imperative operations to the declarative descriptions of the polyhedral model. This representation can handle high level array operations, but is more difficult to work on programs that split the array computations in smaller chunks involving scalar temporary variables.

The aim of the current paper is to provide an overview of the algorithms implemented in GRAPHITE, and to convey the intuitive ideas behind the changes needed to the classical polyhedral representations to accommodate to an SSA three address representation. These changes tend to shift the polyhedral representation to a purely declarative language.

Augmented GPROF Tool for Runtime Profiling/Tracing of Multithreaded Applications and OS Activity

Sandro Bartolini and Antonio C. Prete

This paper presents an approach for profiling and tracing multithreaded applications with two main objectives. First, extend the positive points and overcome the limitations of GPROF tool when used on parallel applications. Second, focus on gathering information that can be useful for extending the existing GCC profile-driven optimizations and to investigate on new ones for parallel applications. In order to perform an insightful profiling of a multithreaded application, our approach proposes to gather intra-thread and, in addition, inter-thread information. For the latter, Operating System activity, as well as the usage of programmer-level synchronization mechanisms (e.g., semaphores, mutex), have to be taken into account together with the user-level profiling.

The proposed approach exposes various per-thread information like the call-graph, and a number of intra-thread ones like blocking relationship between threads, blocking time, usage patterns of synchronization mechanisms, context switches.

The approach introduces a relatively low overhead which makes it widely applicable: less than 9% on test multithreaded benchmarks and less than 2.5x slowdown for the real MySQL execution.