AgPipeline.github.io

Transformers

This document provides information on the design of transformers as well as providing details on how we’ve implemented that design.

Each transformer is constructed by combining different conceptual Types, each which corresponds to distinct files and specific function signatures. After the Overview section below, the different Types are presented in their own section starting with the Entry Point Type first.

This documentation emphasises Docker usage of the code with the expectation that the concepts can be transferred to non-Docker contexts. In other words, while we make extensive use of Docker, it’s not required.

Overview

Each transformer is the combination of several conceptual Types that are implemented as specifically-named files containing specifically-named-and-parameterized functions (specific function signatures).

The main motivations for this approach are:

To separate common code from mutable code as much as possible. For example, converting a .bin file to a georeferenced .tif image file doesn’t appear to have much in common with creating an othomosaic of the georeferenced image files. However it’s only the specifics of the data transformation that are different, not the flow of control and the runtime environment.
Reduce the cognitive overhead needed to produce a working transformer. By providing a consistent workflow and environment, developers can focus on the task at hand with minimal book keeping knowledge resulting in a reduction of required tasks.
Provide stable transformers quickly. Using the same code for the workflow and the request environment means that mistakes are reduced and testing can be more robust. For example, developers of Algorithms can focus on their piece of a transformer knowing that workflow and environmental concerns are already taken care of.
Flexibility of runtime environments. By isolating the code that works with a specific runtime environment, the ability to migrate a transformer to a different environment is amplified; implement code for your environment, combine it with the entry point and algorithm code, and a new transformer is created for your runtime environment. Next you can combine the proven environment with any other algorithms you need to easily create your workflow using existing pieces.

The image below shows a graphical overview of how the transformers are organised conceptually and as repositories or libraries. Each of the columns with a thicker border on the right can be where a transformer is considered complete. (In the diagram below, these are the Algorithm and Plot-level columns.)

The explanation of the left-most tabs on the above image are:

Type - the generic type of code concept/repository each column represents
Info - high level intent statement and one or more example repositories in the AgPipeline GitHub Organization
Details - contains information on expected file names, provided functions/methods, and file dependencies
More - additional information that doesn’t fit in the above rows

Referring to the above image, the minimal set of Types needed for a complete transformer are Entry Point, Environmental, and Algorithm. This can be extended to be more specific by adding Plot Level Type. Additionally, if the runtime environment requires it, a Transformer can also have an Override Entry Point Type to perform initialization tasks.

The top-most tabs represent a conceptual grouping of the columns they cover. The explanation for these tabs are:

Optional Entry point - this Type is used when the environment a transformer is running in has requirements that can’t be met with the code base implementing the Types. For example, establishing a connection and fetching metadata needed by downstream code
One complete transformer - the minimal set of Types that needs to be implemented to fulfill the requirements for a transformer
Science transformer - to facilitate the development of Scientific algorithms and their incorporation into workflows, it’s desirable to have specialized transformer templates that are geared towards one set of data conditions; handling plot-level RGB data for example

While in most cases code repositories can correspond to one of the columns in the diagram and build upon each other (moving left to right), there are special cases where a repository contains more than one of the shown columns. This is typically due to special cases where the default workflow isn’t sufficient for the task.

Transformer environmental expectations

Transformers have a few runtime environment expectations that are important to note:

workspace (or scratch space): a disk location where files can be created, deleted, or modified as needed
optional request metadata: sometimes metadata of some sort is provided that defines the scientific environment that it’s running in and information related to the current request
transformer metadata: for recurrent runs, transformer specific metadata returned from any previous requests
resulting metadata: transformer returned metadata specific to the current request will be correctly handled by the calling process (stored, moved, ignored, etc). Transformers are expected to place their work in the specified workspace and not worry about where the results ultimately end up.
cleanup: the calling process will cleanup the environment used by the transformer based upon its knowledge of the request and the results returned by the transformer (cleaning up the workspace, for example). In other words, transformers are not responsible for cleanup.

Type Details

This section provides additional information on each of the Types in the diagram shown in the Overview section above. Each of the Type subsections provided here has information on its intent along with other concepts related to that Type.

Entry Point

As the name implies, this is considered the entry point to a transformer. The purpose of the entry point is to provide a common flow of control that each transformer can utilize, and a basic set of command line parameters common to all derived transformers.

Refer to the Entry Point page for a more technical description of this Type.

While the entry point defines the flow of control through a transformer, it doesn’t have an expectation for the data the Environmental or Algorithm Types receive. For example, the default Entry Point in the AgPypeline library’s implementation uses a dict to allow each implementation of the Environment Type to define its set of parameters to pass to the Algorithm implementation. The contents of this dict is defined by the Environment and Algorithm code, and not by the Entry Point code.

At times it may be necessary to create a custom entry point for transformers and the AgPypeline library implementation is designed to handle that case. The AgPypeline library implementation is not only designed to be flexible but to provide a consistent experience for developers, maintainers, and anyone else. Please refer to the documentation for detailed information on the implementation of this Type.

The implementation of this Type can then be built into a library or Docker image which can subsequently be used by other Types. For the AgPipeline project we have built this type into the AgPypeline library.

Note that in some cases a preamble needs to happen before the main work can be done. For example, configuring and waiting on a message queue. Refer to the Override Entry Point documentation for more information on preambles.

Environmental

The purpose of the Environmental Type is to interpret the context that a transformer is running in and to provide a consistent interface into that context on a per-request basis. This is accomplished logistically by encapsulating the context in a class and passing an instance of that class to the Algorithm Type.

Refer to the Environment page for a more technical description of this Type.

What this means in practical terms is that transformers for different environments can use the same Entry Point and Algorithm code. This is accomplished by creating different implementations of the Environmental Type. For example, the common_image folder in the ua-gantry-environment repository, or the environment.py file in the AgPypeline library repository.

The AgPypeline Environmental implementation (realized in environment.py) uses metadata and working space values specified on the command line to obtain its request environment, initialize its runtime instance, provide a standard set of data for the Algorithm, and specifies a standard set of Algorithm parameters.

This frees the Algorithm code from having any specialized knowledge of its execution environment.

Please refer to the Environmental documentation for detailed information on an implementation of this Type.

Algorithm

Now that the workflow is defined by the Entry Point Type and the environment is standardized by the Environmental Type, the algorithm can focus on processing the data it receives.

Refer to the Algorithm page for a more technical description of this Type.

Algorithms can be everything from scrubbing and standardizing metadata to calculating canopy cover on a plot-level basis. When combined with the workflow implemented by the Entry Point Type and a specific Environmental Type, we have a complete Transformer.

Continuing with the theme of making the creation of transformers as simple as possible, this Type can be specialized to work on a very specific set of data. For the AgPipeline organization, this means plot-level Algorithms for RGB, Lidar, and other data products. Each of the implementations for these plot-level datum is represented by a distinct Algorithm Type that handles the details of looking for and loading data, understanding the common environment, and other tasks (such as knowing the plot name). The writer of Scientific algorithms that work on the plot level only needs to process the actual data and produce the results.

Please refer to the Algorithm documentation for detailed information on this Type.

Plot-level

This type significantly reduces the overhead associated with creating a transformer. By providing information on what the transformer does and some scientific context, along with the analysis results, a complete Transformer can be created for different environments with minimal work.

The intent here is to remove as much context from the implementation of the analysis as possible, by providing it elsewhere through the Types listed above. This will allow the analysis developer to focus on implementing their piece of the transformer as narrowly as possible, massively reducing the specialized knowledge necessary for developing a full transformer.

By providing a simple solution to writing analysis code and providing a common means of testing, a solution can be developed with confidence that the solution has met the minimum requirements and should work as expected.

Please refer to the plot-level RGB documentation for detailed information on this Type as implemented for RGB data.

Override Entry Point

This Type is for the situations where the existing flow of control and Type (Environmental, Algorithm, etc.) implementations are sufficient, but can’t provide the completely correct environment the transformer needs. For example, needing to configure and listen on a message queue before running the rest of the transformer.

There aren’t any specifics behind this Type and the implementation details are up to the developers.

What is expected in the Docker environment is that the ENTRYPOINT for the image(s) will be changed to point to a new location. One way to create the Docker image might be to use a finished image as the basis for your images, and use the Dockerfile to copy your code to the new image while also specify the new entry point.

This site is open source. Improve this page.