mirror of synced 2025-12-19 18:14:56 -05:00

Files

Jimmy Ma 5fb70f1b9f chore: rework docs numbering (#70295 )

2025-12-02 10:55:45 -08:00

24 KiB

Raw Blame History

BasicFunctionalityIntegrationTest Implementation Guide

Summary: Comprehensive guide for implementing the full CDK integration test suite. This test validates edge cases, type handling, schema evolution, and CDC support. Required for production certification.

When to use this: After Phase 8 (working connector with ConnectorWiringSuite passing)

Time estimate: 4-8 hours for complete implementation

What BasicFunctionalityIntegrationTest Validates

Comprehensive test coverage (50+ scenarios):

Data Type Handling

All Airbyte types (string, integer, number, boolean, date, time, timestamp)
Nested objects and arrays
Union types (multiple possible types for one field)
Unknown types (unrecognized JSON schema types)
Null values vs unset fields
Large integers/decimals (precision handling)

Sync Modes

testAppend() - Incremental append without deduplication
testDedupe() - Incremental append with primary key deduplication
testTruncate() - Full refresh (replace all data)
testAppendSchemaEvolution() - Schema changes during append

Schema Evolution

Add column
Drop column
Change column type (widening)
Nullable to non-nullable changes

CDC Support (if enabled)

Hard delete (actually remove records)
Soft delete (tombstone records)
Delete non-existent records
Insert + delete in same sync

Edge Cases

Empty syncs
Very large datasets
Concurrent streams
State checkpointing
Error recovery

Prerequisites

Before starting, you must have:

✅ Phase 8 complete (ConnectorWiringSuite passing)
✅ Phase 13 complete (if testing dedupe mode)
✅ Working database connection (Testcontainers or real DB)
✅ All sync modes implemented

Testing Phase 1: BasicFunctionalityIntegrationTest

Testing Step 1: Implement Test Helper Classes

Step 1.1: Create DestinationDataDumper

Purpose: Read data from database for test verification

File: src/test-integration/kotlin/.../{DB}DataDumper.kt

package io.airbyte.integrations.destination.{db}

import io.airbyte.cdk.load.command.DestinationStream
import io.airbyte.cdk.load.data.*
import io.airbyte.cdk.load.test.util.OutputRecord
import io.airbyte.cdk.load.test.util.destination.DestinationDataDumper
import javax.sql.DataSource

class {DB}DataDumper(
    private val dataSource: DataSource,
) : DestinationDataDumper {

    override fun dumpRecords(stream: DestinationStream): List<OutputRecord> {
        val tableName = stream.descriptor.name  // Or use name generator
        val namespace = stream.descriptor.namespace ?: "test"

        val records = mutableListOf<OutputRecord>()

        dataSource.connection.use { connection ->
            val sql = "SELECT * FROM \"$namespace\".\"$tableName\""
            connection.createStatement().use { statement ->
                val rs = statement.executeQuery(sql)
                val metadata = rs.metaData

                while (rs.next()) {
                    val data = mutableMapOf<String, AirbyteValue>()

                    for (i in 1..metadata.columnCount) {
                        val columnName = metadata.getColumnName(i)
                        val value = rs.getObject(i)

                        // Convert database value to AirbyteValue
                        data[columnName] = when {
                            value == null -> NullValue
                            value is String -> StringValue(value)
                            value is Int -> IntegerValue(value.toLong())
                            value is Long -> IntegerValue(value)
                            value is Boolean -> BooleanValue(value)
                            value is java.math.BigDecimal -> NumberValue(value)
                            value is java.sql.Timestamp -> TimestampWithTimezoneValue(value.toInstant().toString())
                            value is java.sql.Date -> DateValue(value.toLocalDate().toString())
                            // Add more type conversions as needed
                            else -> StringValue(value.toString())
                        }
                    }

                    // Extract Airbyte metadata columns
                    val extractedAt = (data["_airbyte_extracted_at"] as? TimestampWithTimezoneValue)?.value?.toLong() ?: 0L
                    val generationId = (data["_airbyte_generation_id"] as? IntegerValue)?.value?.toLong() ?: 0L
                    val meta = data["_airbyte_meta"]  // ObjectValue with errors/changes

                    records.add(
                        OutputRecord(
                            extractedAt = extractedAt,
                            generationId = generationId,
                            data = data.filterKeys { !it.startsWith("_airbyte") },
                            airbyteMeta = parseAirbyteMeta(meta)
                        )
                    )
                }
            }
        }

        return records
    }

    private fun parseAirbyteMeta(meta: AirbyteValue?): OutputRecord.Meta {
        // Parse _airbyte_meta JSON to OutputRecord.Meta
        // For now, simple implementation:
        return OutputRecord.Meta(syncId = 0)
    }
}

What this does:

Queries database table for a stream
Converts database types back to AirbyteValue
Extracts Airbyte metadata columns
Returns OutputRecord list for test assertions

Step 1.2: Create DestinationCleaner

Purpose: Clean up test data between test runs

File: src/test-integration/kotlin/.../{DB}Cleaner.kt

package io.airbyte.integrations.destination.{db}

import io.airbyte.cdk.load.test.util.destination.DestinationCleaner
import javax.sql.DataSource

class {DB}Cleaner(
    private val dataSource: DataSource,
    private val testNamespace: String = "test",
) : DestinationCleaner {

    override fun cleanup() {
        dataSource.connection.use { connection ->
            // Drop all test tables
            val sql = """
                SELECT table_name
                FROM information_schema.tables
                WHERE table_schema = '$testNamespace'
            """

            connection.createStatement().use { statement ->
                val rs = statement.executeQuery(sql)
                val tablesToDrop = mutableListOf<String>()

                while (rs.next()) {
                    tablesToDrop.add(rs.getString("table_name"))
                }

                // Drop each table
                tablesToDrop.forEach { tableName ->
                    try {
                        statement.execute("DROP TABLE IF EXISTS \"$testNamespace\".\"$tableName\" CASCADE")
                    } catch (e: Exception) {
                        // Ignore errors during cleanup
                    }
                }
            }

            // Optionally drop test namespace
            try {
                connection.createStatement().use {
                    it.execute("DROP SCHEMA IF EXISTS \"$testNamespace\" CASCADE")
                }
            } catch (e: Exception) {
                // Ignore
            }
        }
    }
}

What this does:

Finds all tables in test namespace
Drops them to clean up between tests
Runs once per test suite (not per test)

Testing Step 2: Create BasicFunctionalityIntegrationTest Class

Step 2.1: Understand Required Parameters

BasicFunctionalityIntegrationTest has 14 required constructor parameters (15 for dataflow CDK):

Parameter	Type	Purpose	Common Value
`configContents`	String	Database config JSON	Load from secrets/config.json
`configSpecClass`	Class	Specification class	`{DB}Specification::class.java`
`dataDumper`	DestinationDataDumper	Read data for verification	`{DB}DataDumper(dataSource)`
`destinationCleaner`	DestinationCleaner	Clean between tests	`{DB}Cleaner(dataSource)`
`isStreamSchemaRetroactive`	Boolean	Schema changes apply retroactively	`true` (usually)
`dedupBehavior`	DedupBehavior?	CDC deletion mode	`DedupBehavior(CdcDeletionMode.HARD_DELETE)`
`stringifySchemalessObjects`	Boolean	Convert objects without schema to strings	`false`
`schematizedObjectBehavior`	SchematizedNestedValueBehavior	How to handle nested objects	`PASS_THROUGH` or `STRINGIFY`
`schematizedArrayBehavior`	SchematizedNestedValueBehavior	How to handle nested arrays	`STRINGIFY` (usually)
`unionBehavior`	UnionBehavior	How to handle union types	`STRINGIFY` or `PROMOTE_TO_OBJECT`
`supportFileTransfer`	Boolean	Supports file uploads	`false` (for databases)
`commitDataIncrementally`	Boolean	Commit during sync vs at end	`true`
`allTypesBehavior`	AllTypesBehavior	Type handling configuration	`StronglyTyped(...)`
`unknownTypesBehavior`	UnknownTypesBehavior	Unknown type handling	`PASS_THROUGH`
`nullEqualsUnset`	Boolean	Null same as missing field	`true`
`useDataFlowPipeline`	Boolean	Use dataflow CDK architecture	`true` ⭐ REQUIRED for dataflow CDK

Step 2.2: Create Test Class

File: src/test-integration/kotlin/.../{DB}BasicFunctionalityTest.kt

package io.airbyte.integrations.destination.{db}

import io.airbyte.cdk.load.test.util.destination.DestinationCleaner
import io.airbyte.cdk.load.test.util.destination.DestinationDataDumper
import io.airbyte.cdk.load.write.AllTypesBehavior
import io.airbyte.cdk.load.write.BasicFunctionalityIntegrationTest
import io.airbyte.cdk.load.write.DedupBehavior
import io.airbyte.cdk.load.write.SchematizedNestedValueBehavior
import io.airbyte.cdk.load.write.UnionBehavior
import io.airbyte.cdk.load.write.UnknownTypesBehavior
import io.airbyte.integrations.destination.{db}.spec.{DB}Specification
import java.nio.file.Path
import javax.sql.DataSource
import org.junit.jupiter.api.BeforeAll

class {DB}BasicFunctionalityTest : BasicFunctionalityIntegrationTest(
    configContents = Path.of("secrets/config.json").toFile().readText(),
    configSpecClass = {DB}Specification::class.java,
    dataDumper = createDataDumper(),
    destinationCleaner = createCleaner(),

    // Schema behavior
    isStreamSchemaRetroactive = true,

    // CDC deletion mode
    dedupBehavior = DedupBehavior(DedupBehavior.CdcDeletionMode.HARD_DELETE),

    // Type handling
    stringifySchemalessObjects = false,
    schematizedObjectBehavior = SchematizedNestedValueBehavior.PASS_THROUGH,
    schematizedArrayBehavior = SchematizedNestedValueBehavior.STRINGIFY,
    unionBehavior = UnionBehavior.STRINGIFY,

    // Feature support
    supportFileTransfer = false,  // Database destinations don't transfer files
    commitDataIncrementally = true,

    // Type system behavior
    allTypesBehavior = AllTypesBehavior.StronglyTyped(
        integerCanBeLarge = false,  // true if your DB has unlimited integers
        numberCanBeLarge = false,   // true if your DB has unlimited precision
        nestedFloatLosesPrecision = false,
    ),
    unknownTypesBehavior = UnknownTypesBehavior.PASS_THROUGH,
    nullEqualsUnset = true,

    // Dataflow CDK architecture (REQUIRED for new CDK)
    useDataFlowPipeline = true,  // ⚠️ Must be true for dataflow CDK connectors
) {
    companion object {
        private lateinit var testDataSource: DataSource

        @JvmStatic
        @BeforeAll
        fun beforeAll() {
            // Set up test database (Testcontainers or real DB)
            testDataSource = createTestDataSource()
        }

        private fun createDataDumper(): DestinationDataDumper {
            return {DB}DataDumper(testDataSource)
        }

        private fun createCleaner(): DestinationCleaner {
            return {DB}Cleaner(testDataSource)
        }

        private fun createTestDataSource(): DataSource {
            // Initialize Testcontainers or connection pool
            val container = {DB}Container("{db}:latest")
            container.start()

            return HikariDataSource().apply {
                jdbcUrl = container.jdbcUrl
                username = container.username
                password = container.password
            }
        }
    }

    // Test methods - uncomment as you implement features

    @Test
    override fun testAppend() {
        super.testAppend()
    }

    @Test
    override fun testTruncate() {
        super.testTruncate()
    }

    @Test
    override fun testAppendSchemaEvolution() {
        super.testAppendSchemaEvolution()
    }

    @Test
    override fun testDedupe() {
        super.testDedupe()
    }
}

Testing Step 3: Configure Test Parameters

Quick Reference Table

Parameter	Typical Value	Purpose
configContents	`Path.of("secrets/config.json").toFile().readText()`	DB connection config
configSpecClass	`{DB}Specification::class.java`	Your spec class
dataDumper	`{DB}DataDumper(testDataSource)`	Read test data (from Step 1)
destinationCleaner	`{DB}Cleaner(testDataSource)`	Cleanup test data (from Step 1)
isStreamSchemaRetroactive	`true`	Schema changes apply to existing data
supportFileTransfer	`false`	Database destinations don't support files
commitDataIncrementally	`true`	Commit batches as written
nullEqualsUnset	`true`	Treat `{"x": null}` same as `{}`
stringifySchemalessObjects	`false`	Use native JSON if available
unknownTypesBehavior	`PASS_THROUGH`	Store unrecognized types as-is
unionBehavior	`STRINGIFY`	Convert union types to JSON string
schematizedObjectBehavior	`PASS_THROUGH` or `STRINGIFY`	See below
schematizedArrayBehavior	`STRINGIFY`	See below

Complex Parameters (Database-Specific)

dedupBehavior

Purpose: How to handle CDC deletions

Options:

// Hard delete - remove CDC-deleted records
DedupBehavior(DedupBehavior.CdcDeletionMode.HARD_DELETE)

// Soft delete - keep tombstone records
DedupBehavior(DedupBehavior.CdcDeletionMode.SOFT_DELETE)

// No CDC support yet
null

allTypesBehavior

Purpose: Configure type precision limits

// Snowflake/BigQuery: Unlimited precision
AllTypesBehavior.StronglyTyped(
    integerCanBeLarge = true,
    numberCanBeLarge = true,
    nestedFloatLosesPrecision = false,
)

// MySQL/Postgres: Limited precision
AllTypesBehavior.StronglyTyped(
    integerCanBeLarge = false,  // BIGINT limits
    numberCanBeLarge = false,   // DECIMAL limits
    nestedFloatLosesPrecision = false,
)

schematizedObjectBehavior / schematizedArrayBehavior

Purpose: How to store nested objects and arrays

Options:

PASS_THROUGH: Use native JSON/array types (Postgres JSONB, Snowflake VARIANT)
STRINGIFY: Convert to JSON strings (fallback for databases without native types)

Recommendations:

Objects: PASS_THROUGH if DB has native JSON, else STRINGIFY
Arrays: STRINGIFY (most DBs don't have typed arrays, except Postgres)

useDataFlowPipeline ⚠️

Value: true - REQUIRED for dataflow CDK connectors

Why critical: Setting to false uses old CDK code paths that don't work with Aggregate/InsertBuffer pattern. Always use true.

⚠️ CRITICAL: All Tests Must Pass - No Exceptions

NEVER rationalize test failures as:

❌ "Cosmetic, not functional"
❌ "The connector IS working, tests just need adjustment"
❌ "Just test framework expectations vs database behavior"
❌ "State message comparison issues, not real problems"
❌ "Need database-specific adaptations (but haven't made them)"

Test failures mean ONE of two things:

1. Your Implementation is Wrong (90% of cases)

State message format doesn't match expected
Schema evolution doesn't work correctly
Deduplication logic has bugs
Type handling is incorrect

Fix: Debug and fix your implementation

2. Test Expectations Need Tuning (10% of cases)

Database truly handles something differently (e.g., ClickHouse soft delete only)
Type precision genuinely differs
BUT: You must document WHY and get agreement this is acceptable

Fix: Update test parameters with clear rationale

Key principle: If tests fail, the connector is NOT working correctly for production use.

Example rationalizations to REJECT:

❌ "Many tests failing due to state message comparison - cosmetic" → State messages are HOW Airbyte tracks progress. Wrong state = broken checkpointing!

❌ "Schema evolution needs MongoDB-specific expectations" → Implement schema evolution correctly for MongoDB, then tests pass!

❌ "Dedupe tests need configuration" → Add the configuration! Don't skip tests!

❌ "Some tests need adaptations" → Make the adaptations! Document what's different and why!

ALL tests must pass or be explicitly skipped with documented rationale approved by maintainers.

Common Rationalizations That Are WRONG

Agent says: "The 7 failures are specific edge cases - advanced scenarios, not core functionality"

Reality:

Truncate/overwrite mode = CORE SYNC MODE used by thousands of syncs
Generation ID tracking = REQUIRED for refresh to work correctly
"Edge cases" = real user scenarios that WILL happen in production
"Advanced scenarios" = standard Airbyte features your connector claims to support

If you don't support a mode:

Don't claim to support it (remove from SpecificationExtension)
Explicitly skip those tests with @Disabled annotation
Document the limitation clearly

If you claim to support it (in SpecificationExtension):

Tests MUST pass
No "works for normal cases" excuses
Users will try to use it and it will break

Agent says: "The connector works for normal use cases"

Reality:

Tests define "working"
"Normal use cases" is undefined - what's normal?
Users will hit "edge cases" in production
Failed tests = broken functionality that will cause support tickets

The rule: If supportedSyncModes includes OVERWRITE, then testTruncate() must pass.

Specific Scenarios That Are NOT Optional

Truncate/Overwrite Mode:

Used by: Full refresh syncs (very common!)
Tests: testTruncate()
NOT optional if you declared DestinationSyncMode.OVERWRITE in SpecificationExtension

Generation ID Tracking:

Used by: All refresh operations
Tests: Generation ID assertions in all tests
NOT optional - required for sync modes to work correctly

State Messages:

Used by: Checkpointing and resume
Tests: State message format validation
NOT optional - wrong state = broken incremental syncs

Schema Evolution:

Used by: When source schema changes
Tests: testAppendSchemaEvolution()
NOT optional - users will add/remove columns

Deduplication:

Used by: APPEND_DEDUP mode
Tests: testDedupe()
NOT optional if you declared DestinationSyncMode.APPEND_DEDUP

None of these are "edge cases" - they're core Airbyte features!

Testing Step 4: Run Tests

Test Individually

# Test append mode
$ ./gradlew :destination-{db}:integrationTest --tests "*BasicFunctionalityTest.testAppend"

# Test dedupe mode
$ ./gradlew :destination-{db}:integrationTest --tests "*BasicFunctionalityTest.testDedupe"

# Test schema evolution
$ ./gradlew :destination-{db}:integrationTest --tests "*BasicFunctionalityTest.testAppendSchemaEvolution"

Run Full Suite

$ ./gradlew :destination-{db}:integrationTest --tests "*BasicFunctionalityTest"

Expected: All enabled tests pass

Time: 5-15 minutes (depending on database and data volume)

Testing Step 5: Debug Common Failures

Test: testAppend fails with "Record mismatch"

Cause: DataDumper not converting types correctly

Fix: Check type conversion in DataDumper:

Timestamps: Ensure timezone handling matches
Numbers: Check BigDecimal vs Double conversion
Booleans: Check 1/0 vs true/false

Test: testDedupe fails with "Expected 1 record, got 2"

Cause: Deduplication not working

Fix: Check upsertTable() implementation:

MERGE statement correct?
Primary key comparison working?
Cursor field comparison correct?

Test: testAppendSchemaEvolution fails with "Column not found"

Cause: Schema evolution (ALTER TABLE) not working

Fix: Check applyChangeset() implementation:

ADD COLUMN syntax correct?
DROP COLUMN supported?
Type changes handled?

Test: Data type tests fail

Cause: Type mapping issues

Fix: Check ColumnUtils.toDialectType():

All Airbyte types mapped?
Nullable handling correct?
Precision/scale for decimals?

Testing Step 6: Optional Test Customization

Skip Tests Not Applicable

// If your DB doesn't support certain features:

// @Test
// override fun testDedupe() {
//     // Skip if no MERGE/UPSERT support yet
// }

Add Database-Specific Tests

@Test
fun testDatabaseSpecificFeature() {
    // Your custom test
}

Reference Implementations

Snowflake

File: destination-snowflake/src/test-integration/.../SnowflakeBasicFunctionalityTest.kt

Parameters:

unionBehavior = UnionBehavior.PROMOTE_TO_OBJECT (uses VARIANT type)
schematizedObjectBehavior = PASS_THROUGH (native OBJECT type)
allTypesBehavior.integerCanBeLarge = true (NUMBER unlimited)

ClickHouse

File: destination-clickhouse/src/test-integration/.../ClickhouseBasicFunctionalityTest.kt

Parameters:

dedupBehavior = SOFT_DELETE (ReplacingMergeTree doesn't support DELETE in MERGE)
schematizedArrayBehavior = STRINGIFY (no native typed arrays)
allTypesBehavior.integerCanBeLarge = false (Int64 has limits)

MySQL

File: destination-mysql/src/test-integration/.../MySQLBasicFunctionalityTest.kt

Parameters:

unionBehavior = STRINGIFY
schematizedObjectBehavior = STRINGIFY (JSON type but limited)
commitDataIncrementally = true

Troubleshooting

"No bean of type [DestinationDataDumper]"

Cause: DataDumper not created in companion object

Fix: Verify createDataDumper() returns {DB}DataDumper instance

"Test hangs indefinitely"

Cause: Database not responding or deadlock

Fix:

Check database is running (Testcontainers started?)
Check for locks (previous test didn't cleanup?)
Add timeout: @Timeout(5, unit = TimeUnit.MINUTES)

"All tests fail with same error"

Cause: Setup/cleanup issue

Fix: Check DestinationCleaner.cleanup() actually drops tables

"Data type test fails for one specific type"

Cause: Type conversion in DataDumper is wrong

Fix: Add logging to see what database returns:

val value = rs.getObject(i)
println("Column $columnName: value=$value, type=${value?.javaClass}")

Success Criteria

BasicFunctionalityIntegrationTest is complete when:

Minimum (Phase 8):

✅ testAppend passes

Full Feature Set (Phase 13):

✅ testAppend passes
✅ testTruncate passes
✅ testAppendSchemaEvolution passes
✅ testDedupe passes

Production Ready (Phase 15):

✅ All tests pass
✅ All type tests pass
✅ CDC tests pass (if supported)
✅ No flaky tests
✅ Tests run in <15 minutes

Time Estimates

Task	Time
Implement DataDumper	1-2 hours
Implement Cleaner	30 min
Create test class with parameters	30 min
Debug testAppend	1-2 hours
Debug other tests	2-4 hours
Total	5-9 hours

Tip: Implement tests incrementally:

testAppend first (simplest)
testTruncate next
testAppendSchemaEvolution
testDedupe last (most complex)

Summary

BasicFunctionalityIntegrationTest is the gold standard for connector validation but has significant complexity:

Pros:

Comprehensive coverage (50+ scenarios)
Validates edge cases
Required for production certification
Catches type handling bugs

Cons:

13 required parameters
5-9 hours to implement and debug
Complex failure modes
Slow test execution

Strategy:

Phase 8: Get working connector with ConnectorWiringSuite (fast)
Phase 15: Add BasicFunctionalityIntegrationTest (comprehensive)
Balance: Quick iteration early, thorough validation later

The v2 guide gets you to working connector without this complexity, but this guide ensures production readiness!

24 KiB Raw Blame History

BasicFunctionalityIntegrationTest Implementation Guide

What BasicFunctionalityIntegrationTest Validates

Data Type Handling

Sync Modes

Schema Evolution

CDC Support (if enabled)

Edge Cases

Prerequisites

Testing Phase 1: BasicFunctionalityIntegrationTest

Testing Step 1: Implement Test Helper Classes

Step 1.1: Create DestinationDataDumper

Step 1.2: Create DestinationCleaner

Testing Step 2: Create BasicFunctionalityIntegrationTest Class

Step 2.1: Understand Required Parameters

Step 2.2: Create Test Class

Testing Step 3: Configure Test Parameters

Quick Reference Table

Complex Parameters (Database-Specific)

dedupBehavior

allTypesBehavior

schematizedObjectBehavior / schematizedArrayBehavior

useDataFlowPipeline ⚠️

⚠️ CRITICAL: All Tests Must Pass - No Exceptions

1. Your Implementation is Wrong (90% of cases)

2. Test Expectations Need Tuning (10% of cases)

Common Rationalizations That Are WRONG

Specific Scenarios That Are NOT Optional

Testing Step 4: Run Tests

Test Individually

Run Full Suite

Testing Step 5: Debug Common Failures

Test: testAppend fails with "Record mismatch"

Test: testDedupe fails with "Expected 1 record, got 2"

Test: testAppendSchemaEvolution fails with "Column not found"

Test: Data type tests fail

Testing Step 6: Optional Test Customization

Skip Tests Not Applicable

Add Database-Specific Tests

Reference Implementations

Snowflake

ClickHouse

MySQL

Troubleshooting

"No bean of type [DestinationDataDumper]"

"Test hangs indefinitely"

"All tests fail with same error"

"Data type test fails for one specific type"

Success Criteria

Time Estimates

Summary

24 KiB

Raw Blame History