Sync & Transform

Pick your sync mode.

Most ETL tools give you one or two ways to move data. Datavor gives you six, each tuned for a specific situation. Your AI tool picks the right one based on what you say — but it helps to know what they do.

⬡ sync_table

Full sync — every row, every time

Full

When to use

Small-to-medium tables where every row matters. Reference data (countries, currencies, products). Initial loads before incremental kicks in. Anything under ~1M rows where snapshot semantics beat tracking complexity.

Truncate target, reload from source
Atomic — target is consistent at end of sync
Idempotent — running twice == running once

In conversation

You: "Copy the products table from prod-pg to analytics-pg, fresh every night."
Claude: Calls scheduler_create_job wrapping sync_table with daily cron.

// Generated tool call: sync_table( source="prod-pg", target="analytics-pg", table="products", mode="full" )

⬡ sync_table_partial

Filtered sync — only rows matching a WHERE clause

WHERE

When to use

You want a slice, not the whole table. Yesterday's orders, customers in a region, the top 10 % most active users. Anything expressible as SQL WHERE.

SQL WHERE clause defines the slice
Combines with full or upsert modes
Useful for dev/staging environment seeding

In conversation

You: "Copy yesterday's orders from prod to staging."
Claude: Recognizes the time predicate, calls sync_table_partial.

// Generated tool call: sync_table_partial( source="prod-pg", target="staging-pg", table="orders", where="created_at >= CURRENT_DATE - INTERVAL '1 day'" )

⬡ sync_table_incremental

Incremental sync — only new + updated rows

Cursor-based

When to use

Large tables that grow over time. Datavor remembers the last sync's max updated_at (or any timestamp/sequence column you pick) and fetches only what's changed since. Orders of magnitude faster than full sync for log-shaped data.

Cursor column auto-detected when possible
Overlap window for late-arriving updates
Falls back to full sync if cursor is missing

In conversation

You: "Sync the orders table from prod to analytics every 15 minutes."
Claude: Sees high-volume table, picks incremental.

// Generated tool call: sync_table_incremental( source="prod-pg", target="analytics-pg", table="orders", cursor_column="updated_at", overlap_minutes=5 )

⬡ sync_table_with_transforms

Sync + inline column transforms

Recipes

When to use

The data crosses a boundary where it needs to change shape. Anonymize PII before warehouse load. Normalize phone numbers. Cents-to-dollars. Lower-case emails. Combine first+last into a single name. Anything column-level, while syncing.

Inline transforms — no separate stage
Apply saved recipes by name
Context Engine stores recipes for reuse

In conversation

You: "Sync orders to analytics, but mask the customer_email and convert amount_cents to amount_usd."
Claude: Calls sync_table_with_transforms.

// Generated tool call: sync_table_with_transforms( source="prod-pg", target="analytics-pg", table="orders", transforms=[ {column: "customer_email", op: "hash_sha256"}, {column: "amount_cents", op: "divide", by: 100, as: "amount_usd"} ] )

⬡ execute_query

Raw SQL — when only SQL will do

SQL

When to use

One-off transformations, schema migrations, custom joins. When the answer requires more than table-to-table movement. Datavor doesn't try to abstract SQL away — it embraces it for the cases where it's the right tool.

SELECT, INSERT, UPDATE, DELETE, DDL
Parameterized — no string concatenation injection risk
Result limits configurable (default 10k rows)

In conversation

You: "Add a tier column to customers, with values from the new tiers table."
Claude: Issues two execute_query calls.

// 1. Add the column execute_query( connection="prod-pg", sql="ALTER TABLE customers ADD COLUMN tier varchar(20)" ) // 2. Backfill it execute_query( connection="prod-pg", sql="UPDATE customers c SET tier = t.tier FROM tiers t WHERE c.id = t.customer_id" )

⬡ get_table_data

Read-only fetch — peek without copying

Read

When to use

Inspecting data before deciding what to do with it. Sample a few rows. Spot-check after a sync. Most-recent log entries. Quick "what does this column actually look like" lookups during conversation.

Optional WHERE clause and ORDER BY
Default LIMIT 100 to keep tokens manageable
Returns rows as JSON — easy for the AI to reason over

In conversation

You: "Show me the latest 5 orders."
Claude: Calls get_table_data, summarizes inline.

// Generated tool call: get_table_data( connection="prod-pg", table="orders", order_by="created_at DESC", limit=5 )

A transform recipe, end-to-end.

One concrete example, top to bottom. Source data is messy — leading whitespace, mixed-case emails, cents-as-integers, raw timestamps. Target needs it clean. Here's the recipe, and what it produces.

① SOURCE orders @ prod-pg

-- 3 rows, raw from production
id: 4821
email: "  Alice@ACME.COM"
amount_cents: 12950
created_at: "2026-05-19 14:23:11"
  -- no TZ specified

id: 4822
email: "BOB@example.org"
amount_cents: 8400
created_at: "2026-05-19 14:23:42"

id: 4823
email: "  carol@test.io  "
amount_cents: 22500
created_at: "2026-05-19 14:24:08"

② RECIPE prod_to_warehouse

// Saved by save_recipe
// Reused via apply_recipe
{
  "name": "prod_to_warehouse",
  "version": 3,
  "transforms": [
    {
      "column": "email",
      "ops": ["trim", "lowercase"]
    },
    {
      "column": "amount_cents",
      "op": "divide",
      "by": 100,
      "rename_to": "amount_usd"
    },
    {
      "column": "created_at",
      "op": "to_utc",
      "source_tz": "America/New_York"
    }
  ]
}

③ TARGET orders @ warehouse

-- After applying recipe
id: 4821
email: "alice@acme.com"
amount_usd: 129.50
created_at: "2026-05-19 18:23:11Z"

id: 4822
email: "bob@example.org"
amount_usd: 84.00
created_at: "2026-05-19 18:23:42Z"

id: 4823
email: "carol@test.io"
amount_usd: 225.00
created_at: "2026-05-19 18:24:08Z"

Use transform_preview to see the target output on sample data before running the sync. No more "let's see what happens" with production data.

Per-record fault tolerance, actually demonstrated.

A 100,000-row sync hits a malformed row at position 47,213. What happens? With most ETL tools, the answer is "everything stops." With Datavor, it's "one row fails, the other 99,999 succeed, you see exactly what broke."

Typical ETL tool FAILS HARD

✓

✗

—

RESULT Job aborts at row 47,213. Target left in partial state. You re-run, hit the same row, abort again. You manually find and fix the bad row, then re-run the entire 100k-row sync from the beginning.

Datavor FAILS GRACEFULLY

✓

RESULT Bad row gets quarantined with its error. Sync continues to the end. Final report: {success: 99,999, quarantined: 1, error: "varchar overflow on email column"}. ErrorLearner records the pattern. Re-run skips the row until you fix or override.

Quarantined rows write to ~/.datavor/quarantine/<job_id>.jsonl with full row content and error context. Your AI can read them, suggest fixes, and re-attempt — all from inside the same conversation.

Schema-aware column mapping.

Source and target tables rarely have identical schemas. Different column names, different types, missing columns. Datavor reconciles automatically — and when it can't, it asks instead of guessing.

Datavor reads both schemas via describe_table, matches by name first, then by name-similarity (customer_email → email), applies type coercion where safe (int → bigint), and prompts only for ambiguous cases. Columns it can't map (like a brand-new tier) get surfaced as a SuggestionEngine recommendation — not silently dropped.

The 6 MCP tools.

The Sync & Transform tools are exposed through MCP. Your AI tool reads the conversation, picks the right one, fills in the parameters. Full reference in the docs.

Tool	Purpose
`sync_table`	Full sync — truncate target, reload from source. Idempotent.
`sync_table_partial`	Sync only rows matching a SQL WHERE clause.
`sync_table_incremental`	Sync only new or updated rows, using a cursor column.
`sync_table_with_transforms`	Sync with inline column-level transforms or a named recipe.
`execute_query`	Run raw SQL — SELECT, INSERT, UPDATE, DELETE, DDL.
`get_table_data`	Fetch rows from a table with optional WHERE / ORDER BY / LIMIT.

Real conversations, real syncs.

Six things people actually say to Datavor every day, and which sync mode each maps to. None of them require knowing the mode names.

Nightly warehouse load

Sync the orders, customers, and products tables from prod to the warehouse every night at 2am. Skip test rows.

uses: sync_table_incremental · scheduler_create_job · add_rule

Staging seed

Copy yesterday's orders to staging, but anonymize the customer emails.

uses: sync_table_partial · sync_table_with_transforms

One-off backfill

Backfill the new tier column on customers from the subscriptions table for everyone signed up before March.

uses: execute_query (UPDATE with JOIN)

Quick lookup mid-conversation

What did the last 10 failed orders look like? Show me their statuses.

uses: get_table_data with WHERE + ORDER BY

Reference data refresh

Update the products table fresh from prod into staging every Monday morning.

uses: sync_table · scheduler_create_job

Cross-cloud migration

Move the events table from our old Cloud SQL to the new Snowflake warehouse, with timestamps converted to UTC.

uses: sync_table_with_transforms · save_recipe

Sync. Transform.
In any direction.

Pick your sync mode.

Full sync — every row, every time

When to use

In conversation

Filtered sync — only rows matching a WHERE clause

When to use

In conversation

Incremental sync — only new + updated rows

When to use

In conversation

Sync + inline column transforms

When to use

In conversation

Raw SQL — when only SQL will do

When to use

In conversation

Read-only fetch — peek without copying

When to use

In conversation

A transform recipe, end-to-end.

Per-record fault tolerance, actually demonstrated.

Typical ETL tool FAILS HARD

Datavor FAILS GRACEFULLY

Schema-aware column mapping.

The 6 MCP tools.

Real conversations, real syncs.

Sync is the easy part.

Sync. Transform.In any direction.

Pick your sync mode.

Full sync — every row, every time

When to use

In conversation

Filtered sync — only rows matching a WHERE clause

When to use

In conversation

Incremental sync — only new + updated rows

When to use

In conversation

Sync + inline column transforms

When to use

In conversation

Raw SQL — when only SQL will do

When to use

In conversation

Read-only fetch — peek without copying

When to use

In conversation

A transform recipe, end-to-end.

Per-record fault tolerance, actually demonstrated.

Typical ETL tool FAILS HARD

Datavor FAILS GRACEFULLY

Schema-aware column mapping.

The 6 MCP tools.

Real conversations, real syncs.

Sync is the easy part.

Sync. Transform.
In any direction.