Idempotent Event Processing in Cloud Run and Pub/Sub: Handling Duplicate Messages

This is the second article of a series. See the footer for the links to the other articles.

This article explains how to handle duplicate events in Google Cloud Pub/Sub when using Cloud Run consumers. Because Pub/Sub provides at-least-once delivery guarantees, duplicate messages are expected in production systems. We explore how to design idempotent event processing pipelines, including where to enforce deduplication: at the application layer, in stateful caches, or in downstream systems like BigQuery. The focus is on practical engineering tradeoffs for building reliable, stateless event-driven systems.

What is idempotency?
Should your Pipeline be Idempotent
Application-Layer Deduplication vs. Database-Layer Deduplication
Designing a Stateless Cloud Run Consumer for AI Workloads

What is Idempotency?

In distributed systems, an operation is considered idempotent if executing it multiple times produces the exact same system state and outcome as executing it a single time.

In the context of an event-driven data pipeline, making an operation idempotent means that if the same event is sent to your processing service five times, the system will not create five separate database records, charge a customer five times, or invoke a downstream AI model five times. The first request mutates the system state; the subsequent four requests recognize the duplicate entry and do not change the state of the system.

Idempotency is not an innate feature of cloud infrastructure, it must be intentionally built into the architecture considering constraints and tradeoffs between different solutions.

Should your Pipeline be Idempotent?

Yes, idempotence is a critical property for ensuring data integrity across distributed systems. We want to prevent critical errors like double-counting revenue, spend, or user engagement metrics, ensuring deterministic outcomes despite infrastructure failures.

Since many event-driven systems are built on managed queues such as Google Cloud Pub/Sub which offer at-least-once delivery guarantees, duplicate delivery is possible.

Several common system behaviors produce duplicates even when no failures are visible at the application layer:

Network retries: transient gRPC or HTTP failures can cause the same publish event to be retried by upstream services.
Worker restarts: Cloud Run instances may be terminated and replaced during scaling events or deployments while still processing in-flight messages or they may simply crash during execution.
Acknowledgement timing variance: messages processed near the acknowledgment deadline may be redelivered if acknowledgment propagation is delayed.
Horizontal scaling race conditions: multiple consumers can temporarily receive the same unacked message during scaling transitions.

Individually, each of these mechanisms is designed for resilience. Collectively, you get a system where duplicate messages are unavoidable and the system must accommodate for them.

Application-Layer Deduplication vs. Database-Layer Deduplication

Focusing only on the application layer to make a pipeline idempotent can be an expensive strategy. Maintaining distributed state (e.g., via a Redis-backed deduplication filter) at the ingestion layer increases costs, network overhead, and complexity, yet still leaves the system exposed to failures mentioned in the previous section.

A more resilient and cost-effective strategy often involves accepting duplicates at the application layer and resolving them deterministically at the data storage layer. Shifting the idempotency boundary downstream from the application to the final storage layer is a strategic architectural choice. It ensures total data integrity without exhausting your system's complexity budget.

Idempotent, Event-driven architecture diagram with Cloud Run, Pub/Sub and Bigquery

An example is in the first article of this series, where I introduce a distributed invoice processor built with Gemini. That architecture acknowledges the possibility of duplicate messages being processed but enforces idempotency after the data lands in BigQuery (leveraging e.g. MERGE statements).

This architectural decision is the result of evaluating tradeoffs and weaknesses of other solutions:

enforcing idempotency at the application layer, makes the service stateful: storage is needed (e.g. a KV store or even a database) to retain memory of the processed events. This complicates the design introducing more failure modes (network partitions, timeouts etc.), increasing the maintenance burden and increasing the attack surface.
Pub/Sub has desirable qualities that make it attractive for an event-driven workflow,
- it can be configured to route undelivered messages to a DLQ automatically.
- It has a retry mechanism built in.
- It absorbs and moderates traffic spikes, slowing down if it starts receiving 429 codes (too many requests) from Cloud Run.
Deduplication in BigQuery is simpler from a technological and managerial perspective.

Here is a quick comparison table that summarizes tradeoffs between implementing deduplication with a stateful application vs. a stateless one:

	Application-Layer Deduplication (e.g., Redis / KV Store)	Database-Layer Deduplication (e.g., BigQuery downstream)
Service State Profile	Stateful: Requires an external cache or database to maintain a transactional memory of processed event IDs.	Stateless: The compute worker evaluates every event independently with zero external memory dependencies.
System Complexity	High: Introduces new failure modes including network partitions, storage connection timeouts, and race conditions.	Low: Eliminates extra infrastructure, reducing the maintenance burden and minimizing the application's attack surface.
Pub/Sub Native Synergy	Restricted: Suppresses native at-least-once delivery retry mechanics and complicates automatic Dead-Letter Queue (DLQ) routing.	Maximized: Leverages native Pub/Sub traffic-spike buffering, built-in retry logic, and automatic backpressure scaling flawlessly.
Implementation Effort	Complex: Requires writing distributed lock management, cache eviction policies, and handling transient database errors.	Simple: Handled natively via basic downstream SQL analytical window queries or standard dbt transformation models.
Operational Impact	Adds synchronous network latency to every event check at the ingress gateway layer.	Accepts temporary duplicate records in raw landing tables to preserve low cost/complexity, resilience and throughput speed.

Designing a Stateless Cloud Run Consumer for AI Workloads

Mitigating LLM Latency via Asynchronous Ingress Buffering

In a distributed architecture, coupling message ingestion directly to downstream API dependencies introduces severe systemic risks. If a Cloud Run worker pulls a message from Pub/Sub and immediately routes it into a blocking long-term operation LLM call, the ingestion tier becomes hostage to the runtime characteristics of a separate, distinct domain.

To prevent this thread exhaustion, the consumer separates message ingestion from semantic extraction:

the Cloud Run HTTP router accepts the incoming Pub/Sub push webhook payload and immediately performs a shallow structural validation check (e.g., confirming the payload contains a valid storage bucket URI).
The worker delegates the actual heavy compute task (invoking the Gemini Vertex AI endpoint and evaluating Pydantic type constraints) to asynchronous calls.
The Cloud Run worker does not acknowledge the Pub/Sub message until downstream processing is fully complete^[1]. The HTTP route awaits the asynchronous function calls, such as LLM API calls, before returning an HTTP status code. Only after the execution successfully concludes does the route return a 202 Accepted response, signaling a successful acknowledgment (ack) to the Pub/Sub broker.

It is an anti-pattern to acknowledge the message early by returning a response immediately before processing. Doing so forces Pub/Sub to instantly delete the message from the subscription queue^[2]. If the background process subsequently crashes or times out, the message cannot be recovered, leading to silent data loss. By holding the connection open and intentionally returning a 5xx Server Error (or allowing the request to time out) when an execution fails, the system safely exploits Pub/Sub's native, automatic redelivery mechanics and backoff policies.

Implementing a Key Design Feature: Event Tracking

Because the compute tier does not maintain an internal list of processed invoices, it relies on deterministic data tags passed through the event payload to maintain tracking integrity across the pipeline. Every invoice document is tracked using two distinct identity parameters:

message_id (Transport Identity): This is a unique string auto-generated by Google Cloud Pub/Sub when an event lands in the topic. It is used exclusively to track the transport lifecycle of the message envelope itself. It is critical for debugging infrastructure retries and monitoring delivery lag. The official Google documentation shows an example of the message Pub/Sub delivers on push subscriptions. In Python it can be extracted with Pydantic with the classes below:
```
class PubSubMessage(BaseModel):
    attributes: dict | None = None
    data: str  # base64 encoded
    messageId: str
    publishTime: str
 
 
class PubSubEnvelope(BaseModel):
    message: PubSubMessage
    subscription: str
```
where PubSubEnvelope is the message "envelope" as it is received by your endpoint. The real message inside the envelope (the "event") is base-64 encoded, see the field data above.
event_id (Domain Identity): This is a deterministic string hash or UUID embedded inside the data payload root. It must be stable across retries and globally unique. For example, if you're processing files, consider using its md5 as event_id; it is bound to the file's contents and downstream processes can identify and deduplicate re-delivered events by matching this identifier.

Below is an example of how to extract the incoming distributed trace context from a Cloud Run Pub/Sub push subscription invocation. By leveraging OpenTelemetry, we extract the parent trace context from the message attributes and explicitly inject the unique Pub/Sub message_id as a span attribute. This links the downstream application's execution span directly back to its triggering infrastructure event.

def trace_pubsub_event(span_name: str):
    def decorator(f: Callable):
        """Tracing Decorator"""
 
        @wraps(f)
        async def wrapper(envelope: PubSubEnvelope):
            """Executes the decorated function within a span"""
            tracer = trace.get_tracer(__name__)
 
            attributes = (
                envelope.message.attributes if envelope.message else {}
            )
            context = extract(attributes)
            with tracer.start_as_current_span(span_name, context) as span:
                if (
                    envelope
                    and envelope.message
                    and envelope.message.messageId
                ):
                    span.set_attribute("messaging.system", "gcp_pubsub")
                    span.set_attribute(
                        "messaging.source.name", envelope.subscription
                    )
                    span.set_attribute(
                        "messaging.message.id", envelope.message.messageId
                    )
 
                try:
                    return await f(envelope)
 
                except Exception as e:
                    span.record_exception(e)
                    span.set_status(trace.Status(trace.StatusCode.ERROR))
                    raise
 
        return wrapper
 
    return decorator
 
 
@router.post("/", status_code=status.HTTP_202_ACCEPTED)
@trace_pubsub_event("cloud-function")
async def process_event(envelope: PubSubEnvelope):
    ...

By attaching the same event_id to identical messages, we give the possibility to the downstream data warehouse to deduplicate payloads with the same content.

Notice that OpenTelemetry defines standardized messaging semantic conventions so that Kafka, Pub/Sub, RabbitMQ, SQS, NATS, etc. can be queried consistently across observability backends. In the example above I show how to add standard attributes to the parnent span of the application.

[1] Cloud Run has a maximum request timeout of 60 minutes, and Pub/Sub push subscriptions have a maximum acknowledgment deadline (up to 600 seconds or 10 minutes). If an LLM call takes too long or errors silently, Pub/Sub will redeliver the message while the first instance is still running, creating concurrent duplicate processing. Make sure you configure the timeout of Cloud Run to be shorter than Pub/Sub Deadline.

[2] Even if the subscription is configured to retain acknowledged messages those are not re-delivered

The Event-Driven AI Pipeline Series

Part 1: Building an Event-Driven Invoice Processing Pipeline with Gemini and Vertex AI (You are here)
Part 2: Engineering Idempotency and Handling Duplicate Events in Distributed AI Pipelines (You are here)
Part 3: Serverless Compute: Cloud Run and Pub/Sub Workflow Configuration and Hardening

❯Idempotent Event Processing in Cloud Run and Pub/Sub: Handling Duplicate Messages

Table of Contents