Skip to main content

Initial Load Mode

Starting with release 2.2.0, the consumer can operate in a special mode called initial_load. This mode is optimized for high-volume data ingestion.

Consumer Modes

The consumer supports two modes:

ModeDescription
streamingDefault mode - standard behavior consistent with previous versions
initial_loadOptimized for high-volume data ingestion

When to Use Initial Load Mode

The initial_load mode is recommended for:

  • Customers handling large datasets from Solifi upstream systems (typically more than 5 million messages)
  • Faster initial data ingestion
  • Bootstrapping a new database
Important

The consumer cannot run permanently in initial_load mode. Once data has been loaded, you must switch back to streaming mode.

How Initial Load Mode Works

Streaming Mode (Default)

In streaming mode, the consumer follows an update-if-present principle:

  • Inserts a new record if it doesn't exist
  • Updates it if it does exist

While this ensures data consistency, it can significantly slow down message processing with very large datasets.

Initial Load Mode

In initial_load mode, the consumer operates on an insert-only principle:

  • Processes one topic partition at a time
  • Bypasses update locks
  • Significantly faster ingestion
  • Consumer stops automatically when complete

Performance Comparison

ScenarioStreaming ModeInitial Load Mode
140 topics, 100M messages10-12 days (single consumer)~4 hours (8 consumers)
Database lockingUpdate locksNo locking
Memory usageModerateHigher during load

Running Initial Load Mode

The process involves four steps:

  1. Perform a Dry Run
  2. Execute the Initial Load
  3. Monitor Load Progress
  4. Switch Back to Streaming Mode

Step 1: Perform a Dry Run

Before performing the actual data load, start with a clean backend database.

The dry run phase identifies all topics and partitions and records offset information without consuming any messages.

solifi:
initial-load:
enabled: true # Enables initial_load mode
dryrun: true # Performs discovery only
batch-size: 10000 # Optional. Messages per batch (default: 100000)
save-full-audit: false # Optional. Save all audit records (default: false)
clientId: load-app-1 # Unique identifier for this consumer instance

The process completes within a few seconds and creates a database table named lp_initial_load.

lp_initial_load Table Structure

ColumnDescription
topic_nameName of the Kafka topic
partition_idPartition number of the topic
start_offsetOffset from which the consumer starts reading
end_offsetOffset up to (but not including) which the consumer reads
statusCurrent processing status
total_entriesTotal number of unique messages identified
load_duration_msTime taken to process the partition

Status Lifecycle

StateDescriptionNext States
INITIALDefault stateALLOCATED
ALLOCATEDReserved for processingLOADED, FAILED
LOADEDData read up to end_offset - 1SAVED, FAILED
SAVEDRecords written to databaseCOMPLETED, FAILED
COMPLETEDPartition fully processed— (Terminal)
FAILEDProcessing error occurred— (Manual reset required)

Step 2: Execute the Initial Load

After the dry run, modify your configuration:

solifi:
initial-load:
enabled: true
dryrun: false # Enable actual loading
batch-size: 20000
clientId: load-app-1

Each clientId instance processes one topic at a time and handles all of its partitions sequentially.

Running Multiple Instances

To speed up processing, run multiple consumer instances in parallel:

  1. Assign different clientId values to each instance
  2. Use the same Kafka consumer group ID
spring:
kafka:
consumer:
group-id: mycompany-initial-load # Same for all instances
Parallel Processing

When multiple instances start with unique clientId values, each acquires a lock on one topic. After processing all partitions, it looks for another topic with status = INITIAL. If none remain, the instance shuts down.

Step 3: Monitor Load Progress

Monitor progress using SQL queries against the lp_initial_load table.

View Current Progress

SELECT * 
FROM lp_initial_load
WHERE status NOT IN ('INITIAL')
ORDER BY last_updated_ts DESC;

View Load Duration (Melbourne Time)

SELECT 
SWITCHOFFSET(MIN(load_started_ts), '+10:00') AS Start,
SWITCHOFFSET(MAX(last_updated_ts), '+10:00') AS Last,
DATEDIFF(MINUTE, MIN(load_started_ts), MAX(last_updated_ts)) AS Minutes,
CONVERT(VARCHAR(5), DATEADD(MINUTE, DATEDIFF(MINUTE, MIN(load_started_ts), MAX(last_updated_ts)), 0), 114) AS Duration
FROM lp_initial_load;

View Progress by Consumer Instance

SELECT status, status_info, COUNT(*) AS Count
FROM lp_initial_load
GROUP BY status, status_info
ORDER BY status, status_info;

Step 4: Switch Back to Streaming Mode

Once all partitions are processed, verify that every record has COMPLETED status.

Remove or disable the initial-load section:

solifi:
# initial-load:
# enabled: false
# dryrun: false
# batch-size: 20000
# clientId: load-app-1

The consumer will resume from the end_offset values recorded in lp_initial_load table.

Backlog Processing

If upstream systems continue producing messages during the initial load, a small backlog may accumulate. The consumer automatically resumes from the latest offsets upon switching back to streaming mode.

The save-full-audit Property

When save-full-audit is set to true, the consumer stores all records for a given key in the audit table. This differs from the default behavior.

Default Behavior (save-full-audit: false)

Only the latest record for each key is saved in the audit table.

With save-full-audit: true

Every record associated with a key is persisted.

Example

Given a topic names with id as the key:

idname
1Sam
2Fred
3Brett
2Rom
2Rex

Default behavior (save-full-audit: false):

names_audit table:

idnamelp_db_action
1SamINITIAL
2RexINITIAL
3BrettINITIAL

With save-full-audit: true:

names_audit table:

idnamelp_db_action
1SamINITIAL
2FredINITIAL
3BrettINITIAL
2RomINITIAL
2RexINITIAL
Performance Impact

Enabling save-full-audit increases overall execution time.

Recovering from Failed Loads

Warning

Do not attempt recovery while the consumer is running in streaming mode. This may result in inconsistent data.

Failures typically occur due to:

  • Insufficient resources (especially memory)
  • Database downtime
  • Network interruptions
  1. Start with a fresh database (no existing data or audit tables)
  2. Repeat from Step 1: Perform a Dry Run
  1. Identify failed topics and partitions using lp_initial_load
  2. Drop their corresponding data and audit tables
  3. Reset their statuses to INITIAL

Example with 2 failed topics:

topic_namepartitionstatus
topic_a0FAILED
topic_a1FAILED
topic_b0ALLOCATED
topic_b1SAVED

Drop the tables and reset:

-- Drop tables first, then:
UPDATE lp_initial_load
SET status = 'INITIAL'
WHERE topic_name IN ('topic_a', 'topic_b');

Restart the consumer in initial_load mode.

Infrastructure Recommendations

Performance depends on:

  • Volume of data per topic and partition
  • Number of parallel consumer instances
  • Database capacity (CPU, memory, storage)

Sample Workload Results

ParameterValue
Total topics144
Average partitions per topic6
Largest partition1 million messages
Total messages46 million
WorkloadDuration
Total messages (46M)2 hours
With Audit (92M)3 hours 47 mins
With Full Audit (95M)4 hours 17 mins

Configuration Used

  • Consumer instances: 8 (each with 4 GB memory and 4 CPU)
  • Database instance: 8 CPU, 16 GB memory
  • Network: Local, no firewalls or packet inspection

Post-Load Resources

After switching to streaming mode, resource requirements drop substantially:

  • Typical configuration: 2 CPU / 2 GB memory per consumer instance

Next Steps