Build multi-tier applications and AI workflows with resilient AWS Lambda functions

Modern applications increasingly require complex and long-term coordination between services, such as multi-step payment processing, orchestration of AI agents, or approval processes awaiting human decisions. Building them has traditionally required significant effort to implement health management, fault resolution, and integration of multiple infrastructure services.

Starting today, you can use AWS Lambda’s robust functions to build reliable multi-step applications right in the familiar AWS Lambda environment. Durable functions are regular Lambda functions with the same event handlers and integrations you already know. You write sequential code in your preferred programming language, and persistent functions track progress, automatically retry failures, and pause execution for up to one year at defined points without paying for idle computation while you wait.

Persistent AWS Lambda functions use a checkpointing and replay mechanism known as persistent execution to provide these capabilities. After enabling the persistent execution feature, you add a new open source persistent execution SDK to the feature code. You then use SDK primitives like “steps” to add automatic checkpointing and retries to your business logic, and “waits” to effectively suspend execution without computational overhead. When execution terminates unexpectedly, Lambda resumes from the last checkpoint and replays your event handler from the beginning, skipping the completed operations.

Getting Started with AWS Lambda Resilient Functions

Let me walk you through how to use sustainable features.

First, I create a new Lambda function in the console and select Author from scratch. IN Sustainable implementation section, I choose Enable. Note that persistent function settings can only be set during function creation and cannot currently be modified for existing Lambda functions.

After creating a persistent Lambda function, I can start with the provided code.

Persistent Lambda functions introduce two basic primitives that handle state management and recovery:

Steps-Tea context.step() method adds automatic repetition and checkpoints to your business logic. Once a step is completed, it will be skipped during playback.
Wait-Tea context.wait() the method suspends execution for a certain amount of time, terminates the function, suspends, and resumes execution without computational charges.

In addition, lambda-resistant functions provide additional operations for more complex patterns: create_callback() creates a callback that you can use to wait for results for external events such as API responses or human approvals, wait_for_condition() pauses until a certain condition is met, such as polling the REST API to complete the process, and parallel() gold map() operations for advanced concurrent use cases.

Building a production-ready order processing workflow

We will now extend the default example and create a production-ready order processing workflow. It shows how to use callbacks for external approval, handle errors correctly, and configure retry strategies. I keep the code intentionally brief to focus on these core concepts. In a full implementation, you can enhance the verification step with Amazon Bedrock and add order analytics using artificial intelligence.

The order processing workflow works like this:

First, validate_order() checks the order data to ensure that all required fields are present.
Other, send_for_approval() sends the command to external human approval and waits for the callback response, thus suspending execution without computation fees.
Then, process_order() complete order processing.
Try-catch error handling during a workflow distinguishes between terminal errors that stop execution immediately and correctable errors in steps that trigger automatic retries.

Here is the complete order processing workflow with step definitions and main handler:

import random
from aws_durable_execution_sdk_python import (
    DurableContext,
    StepContext,
    durable_execution,
    durable_step,
)
from aws_durable_execution_sdk_python.config import (
    Duration,
    StepConfig,
    CallbackConfig,
)
from aws_durable_execution_sdk_python.retries import (
    RetryStrategyConfig,
    create_retry_strategy,
)


@durable_step
def validate_order(step_context: StepContext, order_id: str) -> dict:
    """Validates order data using AI."""
    step_context.logger.info(f"Validating order: {order_id}")
    # In production: calls Amazon Bedrock to validate order completeness and accuracy
    return {"order_id": order_id, "status": "validated"}


@durable_step
def send_for_approval(step_context: StepContext, callback_id: str, order_id: str) -> dict:
    """Sends order for approval using the provided callback token."""
    step_context.logger.info(f"Sending order {order_id} for approval with callback_id: {callback_id}")
    
    # In production: send callback_id to external approval system
    # The external system will call Lambda SendDurableExecutionCallbackSuccess or
    # SendDurableExecutionCallbackFailure APIs with this callback_id when approval is complete
    
    return {
        "order_id": order_id,
        "callback_id": callback_id,
        "status": "sent_for_approval"
    }


@durable_step
def process_order(step_context: StepContext, order_id: str) -> dict:
    """Processes the order with retry logic for transient failures."""
    step_context.logger.info(f"Processing order: {order_id}")
    # Simulate flaky API that sometimes fails
    if random.random() > 0.4:
        step_context.logger.info("Processing failed, will retry")
        raise Exception("Processing failed")
    return {
        "order_id": order_id,
        "status": "processed",
        "timestamp": "2025-11-27T10:00:00Z",
    }


@durable_execution
def lambda_handler(event: dict, context: DurableContext) -> dict:
    try:
        order_id = event.get("order_id")
        
        # Step 1: Validate the order
        validated = context.step(validate_order(order_id))
        if validated("status") != "validated":
            raise Exception("Validation failed")  # Terminal error - stops execution
        context.logger.info(f"Order validated: {validated}")
        
        # Step 2: Create callback
        callback = context.create_callback(
            name="awaiting-approval",
            config=CallbackConfig(timeout=Duration.from_minutes(3))
        )
        context.logger.info(f"Created callback with id: {callback.callback_id}")
        
        # Step 3: Send for approval with the callback_id
        approval_request = context.step(send_for_approval(callback.callback_id, order_id))
        context.logger.info(f"Approval request sent: {approval_request}")
        
        # Step 4: Wait for the callback result
        # This blocks until external system calls SendDurableExecutionCallbackSuccess or SendDurableExecutionCallbackFailure
        approval_result = callback.result()
        context.logger.info(f"Approval received: {approval_result}")
        
        # Step 5: Process the order with custom retry strategy
        retry_config = RetryStrategyConfig(max_attempts=3, backoff_rate=2.0)
        processed = context.step(
            process_order(order_id),
            config=StepConfig(retry_strategy=create_retry_strategy(retry_config)),
        )
        if processed("status") != "processed":
            raise Exception("Processing failed")  # Terminal error
        
        context.logger.info(f"Order successfully processed: {processed}")
        return processed
        
    except Exception as error:
        context.logger.error(f"Error processing order: {error}")
        raise error  # Re-raise to fail the execution

This code demonstrates several important concepts:

Error handling– The try-catch block handles terminal errors. When an unhandled exception is thrown outside a step (such as a validation check), it terminates execution immediately. This is useful when there is no point in trying again, such as when the order data is invalid.
Repeat step— Inside process_order step, exceptions trigger automatic retry based on default (step 1) or configured RetryStrategy (step 5). This addresses transient failures such as temporary API unavailability.
Logging— I use context.logger for the main handler and step_context.logger internal steps. The context logger suppresses duplicate logs during playback.

Now I create a test event with order_id and call the function asynchronously to start the order workflow. I navigate to Test card and fill in the optional The name of the permanent design to identify this execution. Note that persistent functions provide built-in idempotency. If I call the function twice with the same execution name, the second call returns the existing execution result instead of creating a duplicate.

I can follow the execution by going to Sustainable foreclosures tab in the Lambda console:

Here I can see the status and timing of each step. The execution shows CallbackStarted follows InvocationCompletedmeaning the function has terminated and execution is suspended to avoid idle charges while waiting for an approval callback.

I can now complete the callback directly from the console by selecting Submit success gold Failed to sendor programmatically using the Lambda API.

i choose Submit success.

After the callback completes, execution resumes and processes the order. If process_order a step fails due to a simulated broken API, it will automatically retry based on the configured strategy. Once all attempts are successful, the execution completes successfully.

Launch monitoring with Amazon EventBridge

You can also monitor the continuous execution of functions using Amazon EventBridge. Lambda automatically sends execution state change events to the default event bus, allowing you to create downstream workflows, send notifications, or integrate with other AWS services.

To receive these events, create an EventBridge rule on the default event bus with this pattern:

{
  "source": ("aws.lambda"),
  "detail-type": ("Durable Execution Status Change")
}

Things you should know

Here are the key points to note:

Availability—Lambda resilient features are now available in the US East (Ohio) AWS region. See the AWS Capabilities by Region page for the latest region availability.
Programming language support— AWS Lambda resilient functions support JavaScript/TypeScript (Node.js 22/24) and Python (3.13/3.14) when running. We recommend that you package the Persistent Run SDK with your feature code using your preferred package manager. SDKs evolve quickly, so you can easily update dependencies as new features become available.
Using Lambda versions—When deploying persistent features to production, use Lambda versions to ensure that playback always happens on the same code version. If you update your functional code while execution is suspended, the replay will use the version that started execution, preventing inconsistencies caused by code changes during long-running workflows.
Testing your sustainability features– You can test persistent features locally without AWS credentials using a standalone test SDK with pytest integration and the AWS Serverless Application Model (AWS SAM) command-line interface (CLI) for more comprehensive integration testing.
Open source SDK— SDKs for sustainable execution are open source for JavaScript/TypeScript and Python. You can view the source code, contribute to improvements and stay informed about the latest features.
Prices—See the AWS Lambda Pricing page for more information about pricing for AWS Lambda durable functions.

Get started with AWS Lambda resilient functions by visiting the AWS Lambda console. For more information, see the AWS Lambda Resilient Functions documentation page.

Happy building!

— Donnie

Build multi-tier applications and AI workflows with resilient AWS Lambda functions | Amazon Web Services

Leave a Comment Cancel reply