Architecture
Overview
The LinkedIn Interactions Pipeline is a REST API service that orchestrates the scraping of LinkedIn interactions using Apify and stores the results in BigQuery.
System Flow
1. User POSTs to /apify/linkedin/interactions with LinkedIn targets
↓
2. API triggers Apify scraping jobs (one per target) with webhook URL
↓
3. Apify scrapes LinkedIn data (5-20 minutes)
↓
4. Apify sends completion event to webhook endpoint
↓
5. Webhook handler processes data and saves to BigQuery tables
↓
6. Data available in BigQuery: ApifyLinkedinPost, ApifyLinkedinPostComment, ApifyLinkedinPostReaction
Note: A separate polling job runs hourly as a fallback to catch any missed webhook events.
Components
API Service
- Technology: FastAPI (Python)
- Deployment: Google Cloud Run
- Responsibilities:
- Validate incoming requests
- Trigger Apify scraping jobs
- Handle webhook callbacks from Apify
- Update tracker tables
- Write scraped data to BigQuery
Apify Integration
- Service: Apify Actor for LinkedIn scraping
- Input: LinkedIn handles and scraping parameters
- Output: Posts, comments, and reactions data
- Callback: Webhook to API service on completion
Data Storage
- Platform: Google BigQuery
- Tables:
- Tracker Tables: Track job status and progress
- Data Tables: Store scraped LinkedIn content
- Access: SQL queries for status checks and data retrieval
Fallback Polling Job
- Schedule: Runs hourly
- Purpose: Catch any webhook events that were missed
- Function: Queries Apify for completed runs and processes them
Environments
The system is deployed in three separate environments, each with isolated resources:
| Environment | API Service | Polling Job | Tracker Table | Purpose |
|---|---|---|---|---|
| Dev | brand-interactions-pipeline-dev |
brand-interactions-polling-job-dev |
ApifyInteractionsTrackerTableDev |
Development & testing |
| Staging | brand-interactions-pipeline-staging |
brand-interactions-polling-job-staging |
ApifyInteractionsTrackerTableStaging |
Pre-production validation |
| Production | brand-interactions-pipeline |
brand-interactions-polling-job |
ApifyInteractionsTrackerTable |
Live production use |
Data Flow Details
Request Processing
- Client sends POST request with LinkedIn targets
- API validates request parameters
- For each target, API creates an Apify job with:
- LinkedIn handle
- Scraping parameters (limits, filters)
- Webhook URL for completion notification
- Tracker table entry created with PENDING status
Scraping Phase
- Apify actor starts scraping LinkedIn
- Collects posts, reactions, and comments based on parameters
- Typical duration: 5-20 minutes depending on data volume
Completion Phase
- Apify sends webhook to API service
- Webhook handler:
- Retrieves scraped data from Apify
- Writes posts to
ApifyLinkedinPosttable - Writes comments to
ApifyLinkedinPostCommenttable - Writes reactions to
ApifyLinkedinPostReactiontable - Updates tracker table with COMPLETED status and counts
Fallback Mechanism
- Polling job runs every hour
- Queries tracker table for PENDING/RUNNING jobs
- Checks Apify for completion status
- Processes any completed jobs that weren't caught by webhooks