Architecture

Overview

The LinkedIn Interactions Pipeline is a REST API service that orchestrates the scraping of LinkedIn interactions using Apify and stores the results in BigQuery.

System Flow

1. User POSTs to /apify/linkedin/interactions with LinkedIn targets
   ↓
2. API triggers Apify scraping jobs (one per target) with webhook URL
   ↓
3. Apify scrapes LinkedIn data (5-20 minutes)
   ↓
4. Apify sends completion event to webhook endpoint
   ↓
5. Webhook handler processes data and saves to BigQuery tables
   ↓
6. Data available in BigQuery: ApifyLinkedinPost, ApifyLinkedinPostComment, ApifyLinkedinPostReaction

Note: A separate polling job runs hourly as a fallback to catch any missed webhook events.

Components

API Service

Technology: FastAPI (Python)
Deployment: Google Cloud Run
Responsibilities:
Validate incoming requests
Trigger Apify scraping jobs
Handle webhook callbacks from Apify
Update tracker tables
Write scraped data to BigQuery

Apify Integration

Service: Apify Actor for LinkedIn scraping
Input: LinkedIn handles and scraping parameters
Output: Posts, comments, and reactions data
Callback: Webhook to API service on completion

Data Storage

Platform: Google BigQuery
Tables:
Tracker Tables: Track job status and progress
Data Tables: Store scraped LinkedIn content
Access: SQL queries for status checks and data retrieval

Fallback Polling Job

Schedule: Runs hourly
Purpose: Catch any webhook events that were missed
Function: Queries Apify for completed runs and processes them

Environments

The system is deployed in three separate environments, each with isolated resources:

Environment	API Service	Polling Job	Tracker Table	Purpose
Dev	`brand-interactions-pipeline-dev`	`brand-interactions-polling-job-dev`	`ApifyInteractionsTrackerTableDev`	Development & testing
Staging	`brand-interactions-pipeline-staging`	`brand-interactions-polling-job-staging`	`ApifyInteractionsTrackerTableStaging`	Pre-production validation
Production	`brand-interactions-pipeline`	`brand-interactions-polling-job`	`ApifyInteractionsTrackerTable`	Live production use

Data Flow Details

Request Processing

Client sends POST request with LinkedIn targets
API validates request parameters
For each target, API creates an Apify job with:
LinkedIn handle
Scraping parameters (limits, filters)
Webhook URL for completion notification
Tracker table entry created with PENDING status

Scraping Phase

Apify actor starts scraping LinkedIn
Collects posts, reactions, and comments based on parameters
Typical duration: 5-20 minutes depending on data volume

Completion Phase

Apify sends webhook to API service
Webhook handler:
Retrieves scraped data from Apify
Writes posts to ApifyLinkedinPost table
Writes comments to ApifyLinkedinPostComment table
Writes reactions to ApifyLinkedinPostReaction table
Updates tracker table with COMPLETED status and counts

Fallback Mechanism

Polling job runs every hour
Queries tracker table for PENDING/RUNNING jobs
Checks Apify for completion status
Processes any completed jobs that weren't caught by webhooks