Solids are the functional unit of work in Dagster. A solid's responsibility is to read its inputs, perform an action, and emit outputs. Multiple solids can be connected to create a Pipeline.
Name | Description |
---|---|
@solid | The decorator used to define solids. The decorated function is called the compute_fn . The decorator returns a SolidDefinition |
InputDefinition | InputDefinitions define the inputs to a solid compute function. These are defined on the input_defs argument to the @solid decorator |
OutputDefinition | OutputDefinitions define the outputs of a solid compute function. These are defined on the output_defs argument to the @solid decorator |
SolidDefinition | Base class for solids. You almost never want to use initialize this class directly. Instead, you should use the @solid which returns a SolidDefinition |
Solids are used to organize related computations. Solids can later be assembled into Pipelines Pipelines. Solids generally perform one specific action and are used for batch computations. For example, you can use a solid to:
Solids have several important properties:
Solids are meant to be individually testable and reusable. Dagster provides several APIs that make it easy to create a library of solids that work across test, staging, and production environments that can be re-used across your codebase.
To define a solid, use the @solid
decorator. The decorated function is called the compute_fn
and must have context
as the first argument. The context provides access to important properties and objects, such as solid configuration and resources
@solid
def my_solid(context):
return "hello"
All definitions in dagster expose a config_schema
, making them configurable and parameterizable. The configuration system is explained in detail on Config Schema.
Solid definitions can specify a config_schema
for the solid's configuration. The configuration is accessible through the solid context at runtime. Therefore, solid configuration can be used to specify solid behavior at runtime, making solids more flexible and reusable.
For example, we can define a solid where the API endpoint it queries is define through it's configuration:
@solid(config_schema={"api_endpoint": str})
def my_configured_solid(context):
api_endpoint = context.solid_config["api_endpoint"]
data = requests.get(f"{api_endpoint}/data").json()
return data
Each solid defines a set of inputs and outputs using InputDefinitions
and OutputDefinitions
. Inputs and outputs are used to define dependencies between solids.
Both definitions have a few important properties:
IOManager
. See IOManager for more info.Inputs are passed as arguments to a solid's compute_fn
. They are specified using InputDefinitions
. The value of an input can be passed from the output of another solid, or stubbed (hardcoded) using config.
A solid only starts to execute once all of its inputs have been resolved. Inputs can be resolved in two ways:
# The name is required, but both dagster_type and description are optional.
# - The dagster type will be checked at runtime
# - The description useful for documentation and is displayed in Dagit
InputDefinition(name="abc", dagster_type=str, description="Some description")
InputDefinition(name="xyz", dagster_type=int, description="Some description")
We define input definitions on the @solid
decorator. The argument names of the compute_fn
must match the InputDefinitions
names.
# Inputs abc and xyz must appear in the same order on the compute fn
@solid(
input_defs=[
InputDefinition(name="abc", dagster_type=str, description="Some description"),
InputDefinition(name="xyz", dagster_type=int, description="Some description"),
]
)
def my_input_example_solid(context, abc, xyz):
pass
For simple cases, you can use Python type hints instead of specifying InputDefinitions. However, this will prevent you from being able to set a default value or description for your input.
@solid
def my_typehints_solid(context, abc: str, xyz: int):
pass
Outputs are yielded from a solid's compute_fn
. When you have one output, you can return the output value directly. However, when you have more than one output, you must use yield
using the Output
class to disambiguate between outputs.
Similar to InputDefinitions
, we define OutputDefinitions
on the @solid
decorator.
@solid(
input_defs=[
InputDefinition(name="a", dagster_type=int),
InputDefinition(name="b", dagster_type=int),
],
output_defs=[
OutputDefinition(name="sum", dagster_type=int),
OutputDefinition(name="difference", dagster_type=int),
],
)
def my_input_output_example_solid(context, a, b):
yield Output(a + b, output_name="sum")
yield Output(a - b, output_name="difference")
The first parameter of a solids compute_fn
is the context object, which is an instance of SystemComputeExecutionContext
. The context provides access to:
context.solid_config
)context.log
)context.resources
)context.run_id
)For example, to access the logger and log a info message:
@solid(config_schema={"name": str})
def context_solid(context):
name = context.solid_config["name"]
context.log.info(f"My name is {name}")
Solids are used within a @pipeline
. You can see more information on the Pipelines page. You can also execute a single solid, usually within a test context, using the execute_solid
function. More information can be found at Testing Pipelines and Solids
Here we have a solid with multiple inputs. Notice how the order of the inputs match the order of the arguments on the compute function.
@solid(
input_defs=[
InputDefinition(name="value_a", dagster_type=int),
InputDefinition(name="value_b", dagster_type=int),
]
)
def adder(context, value_a, value_b):
context.log.info(str(value_a + value_b))
When you have a single output, you don't need to yield the output using the Output
class. This is because there is no ambiguity about which output is being emmited.
@solid(output_defs=[OutputDefinition(name="my_name", dagster_type=str)])
def single_output_solid(_context):
return "Dagster"
Here we have a solid that emits multiple outputs. Notice how we have to use yield
here instead of return since we more than one output. It is also imporant to wrap the output in the Output
class, in order to help differentiate different outputs.
@solid(
output_defs=[
OutputDefinition(name="my_name", dagster_type=str),
OutputDefinition(name="age", dagster_type=str),
]
)
def multiple_outputs_solid(_context):
yield Output("dagster", output_name="my_name")
yield Output("dagster", output_name="age")
Inputs and outputs are optionally typed. It is okay to leave out the type of inputs or outputs if needed.
@solid(
input_defs=[
InputDefinition(name="value_a"),
InputDefinition(name="value_b"),
]
)
def untyped_inputs_solid(context, value_a, value_b):
context.log.info(str(value_a + value_b))
If you are only using the "name" argument for inputs, you can drop the input definitions entirely.
@solid
def no_input_defs_solid(context, value_a, value_b):
context.log.info(str(value_a + value_b))
You may find the need to create utilities that help generate solids. In most cases, you should parameterize solid behavior by adding solid configuration. You should reach for this pattern if you find yourself needing to vary the arguments to the @solid
decorator or SolidDefinition
themselves, since they cannot be modified based on solid configuration.
To create a solid factory, you define a function that returns a SolidDefinition
, either directly or by decorating a function with the solid dectorator.
def x_solid(
arg,
name="default_name",
input_defs=None,
**kwargs,
):
"""
Args:
args (any): One or more arguments used to generate the nwe solid
name (str): The name of the new solid.
input_defs (list[InputDefinition]): Any input definitions for the new solid. Default: None.
Returns:
function: The new solid.
"""
@solid(name=name, input_defs=input_defs or [InputDefinition("start", Nothing)], **kwargs)
def _x_solid(context):
# Solid logic here
pass
return _x_solid
Why is a solid called a "solid"? It is a long and meandering journey, from a novel concept, to a familiar acronym, and back to a word.
In a data management system, there are two broad categories of data: source data—meaning the data directly inputted by a user, gathered from an uncontrolled external system, or generated directly by a sensor—and computed data—meaning data that is either created by computing on source data or on other computed data. Management of computed data is the primary concern of Dagster. Another name for computed data would be software-structured data. Or SSD. Given that SSD is already a well-known acronym for Solid State Drives we named our core concept for software-structured data a Solid.