Skip to content

Architecture

Overview

The LinkedIn Interactions Pipeline is a REST API service that orchestrates the scraping of LinkedIn interactions using Apify and stores the results in BigQuery.

System Flow

1. User POSTs to /apify/linkedin/interactions with LinkedIn targets
   ↓
2. API triggers Apify scraping jobs (one per target) with webhook URL
   ↓
3. Apify scrapes LinkedIn data (5-20 minutes)
   ↓
4. Apify sends completion event to webhook endpoint
   ↓
5. Webhook handler processes data and saves to BigQuery tables
   ↓
6. Data available in BigQuery: ApifyLinkedinPost, ApifyLinkedinPostComment, ApifyLinkedinPostReaction

Note: A separate polling job runs hourly as a fallback to catch any missed webhook events.

Components

API Service

  • Technology: FastAPI (Python)
  • Deployment: Google Cloud Run
  • Responsibilities:
  • Validate incoming requests
  • Trigger Apify scraping jobs
  • Handle webhook callbacks from Apify
  • Update tracker tables
  • Write scraped data to BigQuery

Apify Integration

  • Service: Apify Actor for LinkedIn scraping
  • Input: LinkedIn handles and scraping parameters
  • Output: Posts, comments, and reactions data
  • Callback: Webhook to API service on completion

Data Storage

  • Platform: Google BigQuery
  • Tables:
  • Tracker Tables: Track job status and progress
  • Data Tables: Store scraped LinkedIn content
  • Access: SQL queries for status checks and data retrieval

Fallback Polling Job

  • Schedule: Runs hourly
  • Purpose: Catch any webhook events that were missed
  • Function: Queries Apify for completed runs and processes them

Environments

The system is deployed in three separate environments, each with isolated resources:

Environment API Service Polling Job Tracker Table Purpose
Dev brand-interactions-pipeline-dev brand-interactions-polling-job-dev ApifyInteractionsTrackerTableDev Development & testing
Staging brand-interactions-pipeline-staging brand-interactions-polling-job-staging ApifyInteractionsTrackerTableStaging Pre-production validation
Production brand-interactions-pipeline brand-interactions-polling-job ApifyInteractionsTrackerTable Live production use

Data Flow Details

Request Processing

  1. Client sends POST request with LinkedIn targets
  2. API validates request parameters
  3. For each target, API creates an Apify job with:
  4. LinkedIn handle
  5. Scraping parameters (limits, filters)
  6. Webhook URL for completion notification
  7. Tracker table entry created with PENDING status

Scraping Phase

  1. Apify actor starts scraping LinkedIn
  2. Collects posts, reactions, and comments based on parameters
  3. Typical duration: 5-20 minutes depending on data volume

Completion Phase

  1. Apify sends webhook to API service
  2. Webhook handler:
  3. Retrieves scraped data from Apify
  4. Writes posts to ApifyLinkedinPost table
  5. Writes comments to ApifyLinkedinPostComment table
  6. Writes reactions to ApifyLinkedinPostReaction table
  7. Updates tracker table with COMPLETED status and counts

Fallback Mechanism

  1. Polling job runs every hour
  2. Queries tracker table for PENDING/RUNNING jobs
  3. Checks Apify for completion status
  4. Processes any completed jobs that weren't caught by webhooks