GXL was created to fulfill the need to exchange data between reengineering tools. Previously, interoperability between tools relied on converters between local formats. This approach requires case-by-case negotiation of exchange semantics. As the research area matured, it became apparent that a standard exchange format was needed and that this format should provide a mechanism to help articulate semantics.
These experiences with interoperability and local file formats form the context for the development of GXL. Moreover, they circumscribe the requirements and criteria for success for a standard exchange format. In this section, we will describe this background and how it informed the emergence of GXL.
2.1 Interoperability of graph-based tools [top]
A variety of reengineering tools employ graphs as an internal data representation. With improved data interoperability, reengineering workbenches can be composed by choosing the best component for a particular task. A typical reengineering workbench consists of three types of tools: Extractors, Abstractors, and Visualizers .
These tools extract information from software artifacts, such as source code. Examples of such extractors for the C/C++ source language are ACACIA , CPPX , and Columbus/CAN . ASIS (ADA Semantic Interface Specification)  offers similar functionality.
These components of reengineering workbenches analyze the extracted data, generating further information, and sometimes changing the form of the data. Tools of particular interest here treat the data as graphs and these include the PROGRES graph transformation system  and GUPRO . General graph-based query mechanisms such as Grok and GReQL are used for analyzing graph-structured data . RPA uses a relational approach to analyzing software systems . Further specialized abstractors have been developed for architectural analysis and recovery 
, for control flow, data flow, and dependency analysis  , and for software metrics .
These tools display the information derived in the previous steps. This information can be textual or graphical. Source code browsers are typical textual visualizers  . Graphical visualizers have been used to display class diagrams , sequence diagrams , and statecharts . General graph drawing tools, for instance daVinci , Graphlet , and GraphViz , have been used to visualize small- and medium-sized graphs. Large complex graphs are better handled by visualization tools designed for reengineering, such as Rigi  and Shrimp .
2.2 Collaborating Toolsets [top]
The approaches and tools shown above, provide good support for various aspects of reengineering. Individual tools from different workbenches have been combined to tackle a range of reengineering
challenges. Here are some illustrative examples of the data exchanges .
Acacia C++ parser tool kit for C++ and graph drawing, by AT&T Labs. There is a command line interface that allows extraction of facts from Acacia's database of facts about an parsed C++ program. The analysis is at the external declaration level. These facts were converted into
a corresponding TA stream and used to analyzing Mozilla source code using PBS tools.
Dali reverse engineering tool kit, by Software Engineering Institute. This tool kit combines features from a number of tools. To analyze the Linux kernel, SNiFf+'s API was used to extract
facts about Linux. These facts were stored in TA, analyzed using a relational database, and the results viewed using Rigi.
CPPAnal by Harry Sneed and GUPRO from University of Koblenz. The CPPAnal tools extract source code
information on an architectural level from large software systems and stores them in SQL tables. By using GraX these tables are transfered into TGraphs for further analysis with GUPRO tools. The G2HTML filter translates TGraphs into HTML documents. These HTML documents are used in GUPRO to browse large graphs.
All of these collaborations were made possible through converters that take files from one local format and transform the data into another local format. While this approach has been used
successfully, it does not scale well. In other words, a converter would need to be written for each pair of local data formats and this effort quickly becomes unmanageable. A standard exchange format serves an intermediary for these file formats; tool developers would only need to convert to and from the local format and the exchange format. Individual tools may still use their own internal representations, including internal databases, and in-memory data structures, for analyzing and transforming data.
2.3 Requirements for a standard exchange format [top]
Examination of the collaborations in the previous subsection and those reported that we have reported elsewhere  provide insights into the requirements for a standard exchange format. Many of the file formats from reengineering are graphs or are compatible with graphs. Consequently, we also examined formats from graph drawing and graph transformation. We took the best features from these formats and combined them. We decided to
create standard exchange format that was graph-based, because graphs have a well-understood mathematical foundation. By building on this structural elegance, our format can be used to represent data from a range of domains with disparate structures.
Other authors who have performed an examination of data interoperability have identified the following criteria for the success for a standard exchange format   :
works for several purposes (e.g. various levels of abstraction, several languages),
works for multi-million lines of code (e.g. 3 to 10 MLOC),
maps between entities/relationships and source code statements (e.g. line number or AST node),
is incremental (e.g. one subsystem at a time can be added),
is universal (e.g. used by others),
Criterion I reflects the amount of variability in the data exchanged. There are differences in the data model used to represent software, level of abstraction, and the source language input to extractors. A standard exchange format would need to be flexible enough to be an intermediary in these and other situations. Criterion II is essentially scalability. A software system can be quite large, so fine-grained representations, such as an abstract syntax tree, can have millions of nodes and edges. Criteria III and IV give specific requirements and are derived from the software analysis problem domain. The final two criteria relate to adoption. The success of a standard exchange format is measured by the number of users, rather than technical elegance. Therefore, the more people who use the format, the more successful it is and the greater the benefits from using the format. Finally, having a format that is extensible increases its flexibility and its appeal to users from other problem domains who may have additional requirements.
We decided to create a new format rather than use one of the existing because i) we needed a format that was simultaneously compatible with as many of these as possible, ii) has only and all the
necessary graph features, iii) was flexible enough to work with disparate data and different levels of abstraction, and iv) was a simple. We further decided to rely on XML because it allowed us to
define our own structure, while at the same time providing infrastructure for constructing import and export tools. One repercussion of this decision is the size of the files being exchanged. These files will be larger due to XML syntax and the length of tags and parameter names. However, this is a problem faced by all XML users. Moreover, standard compression techniques are effective due to the amount of repetition in the files.
2.4 Genealogy of GXL [top]
Development of GXL began with a merger of GRAph eXchange format (GraX) , Tuple Attribute Language (TA) , and the file format from the PROGRES graph rewriting system
. Features were added to this initial graph model to handle hierarchical graphs and hypergraphs. Other ideas were taken from data formats used in reengineering (ATerms , RPA , RSF ), graph drawing (daVinci , GML/Graphlet , GRL , XGMML , GraphXML , and graph transformation . Consequently, GXL can be be viewed as a generalization of these formats. The genealogy of GXL is depicted in Figure 1.