❯Building an Event-Driven Invoice Processing Pipeline with Gemini and Vertex AI
Exploring a scalable document extraction pipeline using Gemini on Google Cloud.
This article explains how to build a scalable invoice processing pipeline using Gemini on Vertex AI, Cloud Run, Pub/Sub, and BigQuery. The architecture uses event-driven design principles to process invoices asynchronously while combining AI-powered extraction with deterministic validation. We'll cover the architecture, tradeoffs, and production considerations. Details are explored in articles to be published.
Table of Contents
Traditional OCR (like Tesseract or Document AI) relies on coordinate-based template matching or bounding-box heuristic parsers. If an invoice format shifts by 10 pixels, template matching breaks. Gemini extracts fields based on the meaning of "Total Balance" regardless of visual placement.
| Traditional Template-Based OCR | Multimodal LLM Semantic Parsing (Gemini) | |
|---|---|---|
| Primary Extraction Mechanism | Spatial coordinate mapping and bounding-box heuristic extraction | Joint visual and textual token processing (multimodal attention) |
| Layout Adaptability | Brittle; minor structural or formatting shifts break regex or positional rules | Resilient; adapts dynamically to layout variations via semantic intent |
| Data Validation Layer | Requires downstream processing with regex and type checks | Handled natively at runtime via structured schema enforcement (e.g. Pydantic) |
| Handling Low-Resolution Artifacts | High failure rate; bad scans result in garbled text tokens | High contextual recovery; infers correct values based on line-item math and context |
| Compute & Latency Profile | Fast execution (<1s), deterministic pricing, low resource footprint | Variable latency (2–5s), token-based pricing, requires managed infrastructure |
In this article we'll explore an event-driven invoice processing pipeline using Gemini on Vertex AI. Rather than relying on document-specific templates, the pipeline treats invoice extraction as a document-understanding problem and combines AI-based extraction with deterministic validation.
This article gives a high-level overview of the architecture, the tradeoffs behind its design, and some of the limitations that became apparent during implementation. I defer a more in-depth discussion of the details to other articles.
The Architecture Constraints
Before discussing the architecture, it's worth explaining the constraints that influenced it:
- Unpredictable Payload Sizes: Inbound PDF invoice uploads vary in file size and page count.
- Variable Inference Latency: Large language model (LLM) API response times fluctuate based on document complexity & availability.
- Schema Volatility: Invoice formats evolve continually across different vendors.
- At-Least-Once Delivery Guarantees: Ensuring duplicate event processing is preferred over dropping transactions.
- Stateless Operations: Keeping the compute tier simple, isolated, and easy to scale.
These constraints pushed the design toward an asynchronous, event-driven architecture.
Architecture Overview
This design uses an asynchronous ingestion pattern, to avoid the resource constraints and failure modes associated with synchronous, strongly-coupled systems.
Google Cloud Pub/Sub buffers inbound traffic spikes, while Cloud Run workers manage processing throughput via built-in concurrency and backpressure controls. This decouples the ingestion layer from the downstream extraction layer, preventing system overloads during peak business hours.
A simplified workflow is below:
The system operates via a decoupled pipeline where each component has an isolated responsibility:
- Storage Ingestion: Invoices are uploaded directly to a secure Cloud Storage bucket.
- Event Notification: Cloud Storage triggers an object-created event, publishing a message to a Google Cloud Pub/Sub ingestion topic.
- Compute Worker: A stateless Cloud Run container consumes messages from the subscription queue.
- AI Extraction: The worker passes the document content to the Gemini API on Vertex AI using structured schemas.
- Analytics Persistence: The extracted structured JSON data is streamed directly into Google BigQuery.
By decoupling these layers, every component scales independently. If invoice volume bursts faster than the Gemini API processes them, messages securely accumulate in the Pub/Sub queue rather than failing the client upload.
Note: Dead-Letter Queues (DLQs) are configured both before and after the Cloud Run worker stage to capture malformed PDFs or persistent upstream API errors for manual triage.
Using Gemini on Vertex AI for Document Understanding
One thing worth mentioning is that this pipeline doesn't use a traditional OCR stage.
Instead, the PDF converted to Markdown and then passed to Gemini. LLMs are trained on large datasets of Markdown documents, and they excel at interpreting structures (like rows and tables) directly from Markdown.
The two main steps of the workflow look like this:
# Convert document to markdown
document = processor.convert_one(file_handle)
# Call Vertex AI via Gemini API
response = await gemini.generate(document.markdown)The model receives both the document and a predefined output schema describing the invoice fields I want returned.
In another article, I explain with more detail the configuration of the Gemini client with enterprise Vertex AI and structured output.
The target extraction fields are strictly defined by passing a predefined output schema configuration directly to the Gemini client interface. This shifts the problem away from template-based extraction and toward output consistency.
Architectural Deep Dive: Design Decisions & Tradeoffs
Storing Raw Gemini JSON Output instead of Strict Relational Schemas
The pipeline stores raw structured output in a JSON column rather than in a predefined relational schema.
The reason is flexibility. Invoice formats vary, and different vendors expose different fields. Enforcing a rigid schema at ingestion time would tightly couple the extraction layer to downstream analytics.
Trying to enforce a rigid schema at the ingestion boundary creates unnecessary coupling between extraction and analytics. By storing JSON directly, the system allows downstream consumers to decide how to interpret and normalize the data.
Idempotency: Managing Pub/Sub Duplicates Without a Cache Tier
Google Cloud Pub/Sub guarantees at-least-once message delivery. Consequently, transient network retries or worker container restarts will occasionally cause duplicate event processing.
Instead of implementing a distributed caching layer (such as Redis) to manage state deduplication inside the Cloud Run compute tier, the worker remains completely stateless. Every event is processed independently, and duplicate resolution is handled downstream in the BigQuery layer using unique event_ids.
This approach has a few advantages:
- The application layer remains simple to operate.
- No additional cache or database is required.
- Deduplication logic can evolve without redeploying the processing service.
- Analysts can inspect duplicate records rather than having them silently discarded.
The tradeoff is that duplicate records may temporarily exist in the raw dataset until downstream processes consolidate them.
Enforcing Deterministic Business Validation
A structurally valid JSON response from an LLM does not guarantee factual correctness. To solve this, the pipeline decouples structural extraction from business validation logic.
The pipeline uses a simple two-stage validation approach.
Tier 1: Structural Validation with Pydantic and Vertex AI
The system passes a specialized Pydantic schema object straight to the Gemini Vertex AI client using response_schema configurations. This constrains the underlying model to output a strict JSON structure matching the required data types.
class InvoiceHeader(BaseModel):
invoice_id: uuid.UUID = Field(
default_factory=uuid.uuid4,
description="Unique invoice identifier",
)
sender_name: Optional[str] = Field(
None, description="Name of the invoice sender or issuing company"
)
sender_billing_address: Optional[str] = Field(
description="Billing address of the invoice sender or client"
)
sender_shipping_address: Optional[str] = Field(
description="Shipping address of the invoice sender or client"
)
...
line_items: List[LineItem] = Field(
default_factory=list,
description="List of line items associated with the invoice",
)Tier 2: Business Validation
Once the structural JSON passes Pydantic verification, the application runs deterministic Python business logic to evaluate data consistency before writing to storage. If an invoice fails these checks, it is flagged in the database for manual review.
These checks do not substitute a complete validation workflow which is delegated to analytical processing in the following downstream once the data has landed in BigQuery.
class InvoiceHeader(BaseModel):
invoice_id: uuid.UUID = Field(
default_factory=uuid.uuid4,
description="Unique invoice identifier",
)
...
@property
def is_invoice_date_valid(self) -> bool:
"""Returns False if the invoice date is in the future"""
if self.invoice_date:
inv_date = self.invoice_date
if inv_date.tzinfo is None:
inv_date = inv_date.replace(tzinfo=timezone.utc)
return inv_date <= datetime.now(timezone.utc)
return False
@property
def is_subtotal_valid(self) -> bool:
"""
Returns True if the sum of line items' subtotal matches the invoice subtotal,
otherwise False.
"""
if self.subtotal is None:
return False
calculated_subtotal = 0
for item in self.line_items:
if item.subtotal is None:
return False
calculated_subtotal += item.subtotal
return abs(calculated_subtotal - self.subtotal) <= Decimal("0.01")
@property
def is_total_valid(self) -> bool:
"""
Returns True if the sum of line items' total matches the invoice total,
otherwise False.
"""
if self.subtotal is None or self.total is None:
return False
calculated_total = self.subtotal + self.tax_total - self.discount_total
return abs(calculated_total - self.total) <= Decimal("0.01")Production Design Considerations
Handling Late-Arriving Deliveries and Retries
Because the pipeline leverages an event-driven architecture, compute nodes remain decoupled from ingestion rates. If a down-stream dependency faces temporary degradation, standard message acknowledge timeouts (ackDeadline) route messages back to the primary subscription queue automatically, preserving the system's absolute state.
Cost Optimization for High-Volume Extraction
Converting documents to Markdown text profiles before invoking the model dramatically reduces total token usage compared to sending multi-page images directly. For massive batch operations, this token efficiency translates into significant cost reductions over thousands of execution loops.
Markdown conversion is very fast even for multi-page PDFs, synchronous conversion has no impact on performance. Moreover, the API limits size and number of pages for the uploaded PDFs.
Security and Isolation of the Cloud Run Function
The Cloud Run Function is deployed with its dedicated IAM role and restrictive permissions. Moreover, the ingress endpoint is restricted to receive requests from within the Google network only, isolating the instance from the public.
Vertex AI Compliance and Data Privacy (PII)
Because invoices frequently contain Personally Identifiable Information (PII) and sensitive corporate financial data, security architecture is paramount. This pipeline relies on enterprise-grade Vertex AI endpoints, which operate under strict data isolation and governance frameworks distinct from consumer AI services:
- No Model Training: customer data—including uploaded invoice PDFs, prompt metadata, and structured JSON outputs—is strictly isolated within your Google Cloud project boundary. Google does not use customer data submitted to Vertex AI to train or fine-tune foundation models by default under its enterprise terms.
- Ephemeral Processing: in data-isolated regions, API requests are processed ephemerally in memory. limited logging/diagnostic retention may occur under Google Cloud operational and security controls, with enterprise options to restrict logging.
- Compliance Frameworks: the processing pipeline aligns with global enterprise standards, adhering directly to the Google Cloud Data Processing Addendum (DPA) and service-specific Vertex AI terms.
Before pushing to production, organizations should classify data according to internal data retention policies and map compliance requirements (such as GDPR or SOC 2) to their specific deployment region. Make sure you read the Google Cloud Data Processing Addendum (DPA) and Vertex AI service-specific terms.
Key Takeaways
Building production-ready AI pipelines demonstrates that large language models are most effective when confined to bounded, specialized roles within an existing, robust system architecture.
- Isolate probabilistic tasks: use LLMs strictly for semantic extraction, not for system routing or data validation.
- Keep compute tiers stateless: delegate buffering to Pub/Sub and data deduplication to modern data warehouses like BigQuery.
- Rely on classic engineering: queues, retry mechanisms, structural typing, and deterministic validation rules remain the foundational elements that keep AI architectures stable, reliable, and scalable.