DagsterDocs

Solids#

Solids are the functional unit of work in Dagster. A solid's responsibility is to read its inputs, perform an action, and emit outputs. Multiple solids can be connected to create a Pipeline.

Relevant APIs#

NameDescription
@solidThe decorator used to define solids. The decorated function is called the compute_fn. The decorator returns a SolidDefinition
InputDefinitionInputDefinitions define the inputs to a solid compute function. These are defined on the input_defs argument to the @solid decorator
OutputDefinitionOutputDefinitions define the outputs of a solid compute function. These are defined on the output_defs argument to the @solid decorator
SolidDefinitionBase class for solids. You almost never want to use initialize this class directly. Instead, you should use the @solid which returns a SolidDefinition

Overview#

Solids are used to organize related computations. Solids can later be assembled into Pipelines Pipelines. Solids generally perform one specific action and are used for batch computations. For example, you can use a solid to:

  • Execute a database query and store the result
  • Initiate a Spark job in a remote cluster
  • Query an API data and store the result in a data warehouse
  • Send an email or Slack message

Solids have several important properties:

  • Inputs and Outputs: Solids have defined inputs and outputs, which can be optionally typed. These types are validated at runtime.
  • Configurable: Solids can be configured, using a strongly typed configuration system.
  • Dependencies: Solids inputs can depend on the outputs from other solids. A solid will not execute until all of its inputs have been resolved successfully. The dependency structure is defined using a Pipeline.
  • Emit an Event Stream: Solids can emit a stream of structured events, such as These events can be viewed in Dagit, Dagster's UI tool.

Solids are meant to be individually testable and reusable. Dagster provides several APIs that make it easy to create a library of solids that work across test, staging, and production environments that can be re-used across your codebase.


Defining a solid#

To define a solid, use the @solid decorator. The decorated function is called the compute_fn and must have context as the first argument. The context provides access to important properties and objects, such as solid configuration and resources

@solid
def my_solid(context):
    return "hello"

Solid Configuration#

All definitions in dagster expose a config_schema, making them configurable and parameterizable. The configuration system is explained in detail on Config Schema.

Solid definitions can specify a config_schema for the solid's configuration. The configuration is accessible through the solid context at runtime. Therefore, solid configuration can be used to specify solid behavior at runtime, making solids more flexible and reusable.

For example, we can define a solid where the API endpoint it queries is define through it's configuration:

@solid(config_schema={"api_endpoint": str})
def my_configured_solid(context):
    api_endpoint = context.solid_config["api_endpoint"]
    data = requests.get(f"{api_endpoint}/data").json()
    return data

Inputs and Outputs#

Each solid defines a set of inputs and outputs using InputDefinitions and OutputDefinitions. Inputs and outputs are used to define dependencies between solids.

Both definitions have a few important properties:

  • They are named
  • They are optionally typed. These types are validated at runtime.
  • (Advanced) They can be linked to an IOManager. See IOManager for more info.

Inputs#

Inputs are passed as arguments to a solid's compute_fn. They are specified using InputDefinitions. The value of an input can be passed from the output of another solid, or stubbed (hardcoded) using config.

A solid only starts to execute once all of its inputs have been resolved. Inputs can be resolved in two ways:

  • The upstream output that the input depends on has been successfully emmitted and stored.
  • The input was stubbed through config.
# The name is required, but both dagster_type and description are optional.
# - The dagster type will be checked at runtime
# - The description useful for documentation and is displayed in Dagit

InputDefinition(name="abc", dagster_type=str, description="Some description")
InputDefinition(name="xyz", dagster_type=int, description="Some description")

We define input definitions on the @solid decorator. The argument names of the compute_fn must match the InputDefinitions names.

# Inputs abc and xyz must appear in the same order on the compute fn
@solid(
    input_defs=[
        InputDefinition(name="abc", dagster_type=str, description="Some description"),
        InputDefinition(name="xyz", dagster_type=int, description="Some description"),
    ]
)
def my_input_example_solid(context, abc, xyz):
    pass

For simple cases, you can use Python type hints instead of specifying InputDefinitions. However, this will prevent you from being able to set a default value or description for your input.

@solid
def my_typehints_solid(context, abc: str, xyz: int):
    pass

Outputs#

Outputs are yielded from a solid's compute_fn. When you have one output, you can return the output value directly. However, when you have more than one output, you must use yield using the Output class to disambiguate between outputs.

Similar to InputDefinitions, we define OutputDefinitions on the @solid decorator.

@solid(
    input_defs=[
        InputDefinition(name="a", dagster_type=int),
        InputDefinition(name="b", dagster_type=int),
    ],
    output_defs=[
        OutputDefinition(name="sum", dagster_type=int),
        OutputDefinition(name="difference", dagster_type=int),
    ],
)
def my_input_output_example_solid(context, a, b):
    yield Output(a + b, output_name="sum")
    yield Output(a - b, output_name="difference")

Solid Context#

The first parameter of a solids compute_fn is the context object, which is an instance of SystemComputeExecutionContext. The context provides access to:

  • Solid configuration (context.solid_config)
  • Loggers (context.log)
  • Resources (context.resources)
  • The current run ID: (context.run_id)

For example, to access the logger and log a info message:

@solid(config_schema={"name": str})
def context_solid(context):
    name = context.solid_config["name"]
    context.log.info(f"My name is {name}")

Using a solid#

Solids are used within a @pipeline. You can see more information on the Pipelines page. You can also execute a single solid, usually within a test context, using the execute_solid function. More information can be found at Testing Pipelines and Solids

Examples#

With multiple inputs#

Here we have a solid with multiple inputs. Notice how the order of the inputs match the order of the arguments on the compute function.

@solid(
    input_defs=[
        InputDefinition(name="value_a", dagster_type=int),
        InputDefinition(name="value_b", dagster_type=int),
    ]
)
def adder(context, value_a, value_b):
    context.log.info(str(value_a + value_b))

With single outputs#

When you have a single output, you don't need to yield the output using the Output class. This is because there is no ambiguity about which output is being emmited.

@solid(output_defs=[OutputDefinition(name="my_name", dagster_type=str)])
def single_output_solid(_context):
    return "Dagster"

With multiple outputs#

Here we have a solid that emits multiple outputs. Notice how we have to use yield here instead of return since we more than one output. It is also imporant to wrap the output in the Output class, in order to help differentiate different outputs.

@solid(
    output_defs=[
        OutputDefinition(name="my_name", dagster_type=str),
        OutputDefinition(name="age", dagster_type=str),
    ]
)
def multiple_outputs_solid(_context):
    yield Output("dagster", output_name="my_name")
    yield Output("dagster", output_name="age")

Untyped inputs and outputs#

Inputs and outputs are optionally typed. It is okay to leave out the type of inputs or outputs if needed.

@solid(
    input_defs=[
        InputDefinition(name="value_a"),
        InputDefinition(name="value_b"),
    ]
)
def untyped_inputs_solid(context, value_a, value_b):
    context.log.info(str(value_a + value_b))

If you are only using the "name" argument for inputs, you can drop the input definitions entirely.

@solid
def no_input_defs_solid(context, value_a, value_b):
    context.log.info(str(value_a + value_b))

Patterns#

Solid Factory#

You may find the need to create utilities that help generate solids. In most cases, you should parameterize solid behavior by adding solid configuration. You should reach for this pattern if you find yourself needing to vary the arguments to the @solid decorator or SolidDefinition themselves, since they cannot be modified based on solid configuration.

To create a solid factory, you define a function that returns a SolidDefinition, either directly or by decorating a function with the solid dectorator.

def x_solid(
    arg,
    name="default_name",
    input_defs=None,
    **kwargs,
):
    """
    Args:
        args (any): One or more arguments used to generate the nwe solid
        name (str): The name of the new solid.
        input_defs (list[InputDefinition]): Any input definitions for the new solid. Default: None.

    Returns:
        function: The new solid.
    """

    @solid(name=name, input_defs=input_defs or [InputDefinition("start", Nothing)], **kwargs)
    def _x_solid(context):
        # Solid logic here
        pass

    return _x_solid

FAQ#

Why "Solid"?#

Why is a solid called a "solid"? It is a long and meandering journey, from a novel concept, to a familiar acronym, and back to a word.

In a data management system, there are two broad categories of data: source data—meaning the data directly inputted by a user, gathered from an uncontrolled external system, or generated directly by a sensor—and computed data—meaning data that is either created by computing on source data or on other computed data. Management of computed data is the primary concern of Dagster. Another name for computed data would be software-structured data. Or SSD. Given that SSD is already a well-known acronym for Solid State Drives we named our core concept for software-structured data a Solid.