How to Sync HubSpot Company Records to S3 with AWS Lambda and Step Functions

If you need to pull HubSpot company records into S3 on a schedule, a single Lambda function can hit the 15-minute timeout when dealing with large datasets. This guide builds a Python Lambda that syncs HubSpot company records to S3 in batches, with AWS Step Functions handling continuation when time runs short.

Prerequisites

  • An AWS account with permissions to create Lambda functions, Step Functions, and S3 buckets
  • AWS CLI installed and configured — see How to Install AWS CLI v2 on Ubuntu 22.04
  • A HubSpot account with a private app token that has the crm.objects.companies.read scope
  • Python 3.12 or later
  • An S3 bucket to store the exported records

How It Works

  1. Lambda queries the HubSpot CRM search API for recently modified company records
  2. Each batch of results gets saved as a JSON file in S3
  3. If Lambda approaches the 15-minute timeout, it returns incomplete with a pagination cursor
  4. Step Functions checks the response and re-invokes Lambda to continue where it left off
  5. When all records are processed, Lambda returns complete and the workflow ends

This pattern lets you process any number of records without worrying about Lambda’s execution limit.

Project Structure

The Lambda project has three files. For tips on organizing larger projects, see How to Structure Your Python Projects for AWS Lambda, APIs, and CLI Tools.

lambda_hubspot_sync/
├── handler.py         # Lambda entry point
├── utils.py           # HubSpot API and S3 helpers
└── requirements.txt   # Python dependencies

Create the Lambda Handler

The handler runs a loop that fetches HubSpot records in batches and saves each batch to S3. It checks elapsed time after each batch and hands off to Step Functions if it’s approaching the timeout.

handler.py

import os
import time
from datetime import datetime, timedelta, timezone
from utils import get_company_records, write_to_s3

BATCH_SIZE = 100
TIME_LIMIT = 840  # 14 minutes — buffer before Lambda's 15-min limit
BUCKET_NAME = os.environ["BUCKET_NAME"]
OFFSET_MINUTES = int(os.environ["OFFSET_MINUTES"])


def lambda_handler(event, context):
    start_time = time.time()
    offset = event.get("offset", 0)

    # Calculate cutoff timestamp (Unix ms) for each invocation
    cutoff = datetime.now(timezone.utc) - timedelta(minutes=OFFSET_MINUTES)
    last_modified_ms = str(int(cutoff.timestamp() * 1000))

    while True:
        records, next_offset = get_company_records(
            last_modified_ms, BATCH_SIZE, offset
        )

        if not records:
            break

        filename = f"hubspot_companies_batch_{offset}.json"
        write_to_s3(records, BUCKET_NAME, filename)

        if not next_offset:
            break

        offset = next_offset

        if time.time() - start_time > TIME_LIMIT:
            return {"status": "incomplete", "offset": offset}

    return {"status": "complete"}
  • TIME_LIMIT = 840 — stops processing at 14 minutes, leaving a 60-second buffer before Lambda’s hard 900-second cutoff
  • OFFSET_MINUTES — controls how far back to look for modified records (e.g., 30 means “modified in the last 30 minutes”)
  • The offset variable holds HubSpot’s pagination cursor, passed between invocations through Step Functions
  • Returns incomplete with the cursor when time is running out, or complete when all records are processed

Create the Utility Module

The utility module handles the HubSpot API request and S3 upload. For production workloads, add retry logic to the API call — see How to Make Reliable HubSpot API Requests in Python (With Retry Logic).

utils.py

import os
import json
import boto3
import requests

HUBSPOT_API_KEY = os.environ["HUBSPOT_API_KEY"]
S3 = boto3.client("s3")

SEARCH_URL = "https://api.hubapi.com/crm/v3/objects/companies/search"


def get_company_records(last_modified_ms, limit, after):
    headers = {
        "Authorization": f"Bearer {HUBSPOT_API_KEY}",
        "Content-Type": "application/json",
    }
    payload = {
        "filterGroups": [
            {
                "filters": [
                    {
                        "propertyName": "lastmodifieddate",
                        "operator": "GTE",
                        "value": last_modified_ms,
                    }
                ]
            }
        ],
        "properties": ["name", "domain"],
        "limit": limit,
    }
    if after:
        payload["after"] = str(after)

    resp = requests.post(SEARCH_URL, headers=headers, json=payload)
    resp.raise_for_status()
    data = resp.json()

    records = data.get("results", [])
    next_after = data.get("paging", {}).get("next", {}).get("after")

    return records, next_after


def write_to_s3(data, bucket, key):
    S3.put_object(
        Bucket=bucket,
        Key=key,
        Body=json.dumps(data, indent=2).encode("utf-8"),
    )
  • The HubSpot search API expects datetime filter values as Unix timestamps in milliseconds, passed as a string
  • after is only included in the payload when it has a value — the first request starts from the beginning without it
  • resp.raise_for_status() raises an exception on 4xx/5xx responses so failures surface immediately
  • The properties array controls which company fields come back — add any HubSpot properties you need

Add Dependencies

requirements.txt

boto3
requests

boto3 is included in the Lambda runtime by default, but listing it keeps local development consistent. Package requests with your deployment zip or as a Lambda layer — see How to Build and Deploy Python Libraries for AWS Lambda Layers for a step-by-step guide.

Configure Environment Variables

Set these in your Lambda function configuration:

Variable Description Example
BUCKET_NAME S3 bucket for storing exported records my-hubspot-exports
HUBSPOT_API_KEY HubSpot private app token pat-na1-xxxxxxxx
OFFSET_MINUTES How many minutes back to look for modified records 30

For the HUBSPOT_API_KEY, create a private app in your HubSpot account under Settings > Integrations > Private Apps, and grant it the crm.objects.companies.read scope.

Set Up the Step Functions State Machine

The state machine loops the Lambda function until all records are processed. It initializes with a default offset of 0, invokes Lambda, checks the returned status, and either loops back or finishes.

{
  "Comment": "Batch-process HubSpot company records across Lambda invocations",
  "StartAt": "Set Default Input",
  "States": {
    "Set Default Input": {
      "Type": "Pass",
      "Result": {
        "offset": 0
      },
      "Next": "Sync HubSpot Batch"
    },
    "Sync HubSpot Batch": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:hubspot-sync",
      "Parameters": {
        "offset.$": "$.offset"
      },
      "ResultPath": "$",
      "Next": "Check Status"
    },
    "Check Status": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.status",
          "StringEquals": "incomplete",
          "Next": "Sync HubSpot Batch"
        }
      ],
      "Default": "Done"
    },
    "Done": {
      "Type": "Succeed"
    }
  }
}

Replace the Lambda ARN with your actual function ARN. Here’s what each state does:

  • Set Default Input — initializes the offset to 0 so the first Lambda call starts from the beginning
  • Sync HubSpot Batch — invokes Lambda with the current offset. ResultPath: "$" replaces the entire state with Lambda’s output
  • Check Status — if Lambda returned incomplete, loops back with the new offset. Otherwise, moves to Done
  • Done — ends the execution

To create the state machine, go to the AWS Step Functions console, choose “Create state machine”, select “Write your workflow in code”, and paste the JSON above.

Add IAM Permissions

Step Functions Execution Role

The Step Functions role needs permission to invoke your Lambda function:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "lambda:InvokeFunction",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:hubspot-sync"
    }
  ]
}

Lambda Execution Role

The Lambda role needs S3 write access and CloudWatch Logs for debugging:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::my-hubspot-exports/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:us-east-1:123456789012:log-group:/aws/lambda/hubspot-sync:*"
    }
  ]
}

Replace the S3 bucket name, region, account ID, and function name with your own values.

Schedule with EventBridge

To run the sync automatically, create an EventBridge rule that triggers the Step Functions state machine on a schedule. For example, to run every 30 minutes:

cron(0/30 * * * ? *)

Match the schedule interval with your OFFSET_MINUTES value so you don’t miss records or process duplicates. If you schedule every 30 minutes, set OFFSET_MINUTES to 30.

Conclusion

You now have a Lambda function that syncs HubSpot company records to S3 in batches, with Step Functions handling continuation across invocations. This pattern keeps you within Lambda’s timeout limit regardless of how many records need processing.

For a deeper look at the timeout pattern itself, including strategies for handling mid-page timeouts, read How to Avoid AWS Lambda Timeout When Processing HubSpot Records.