Datavor's ETL core. Six sync modes — full, partial, incremental, with-transforms, plus query-level read tools — covering the full spectrum from "copy this table" to "sync only yesterday's orders, transform timestamps to UTC, skip test rows."
Most ETL tools give you one or two ways to move data. Datavor gives you six, each tuned for a specific situation. Your AI tool picks the right one based on what you say — but it helps to know what they do.
Small-to-medium tables where every row matters. Reference data (countries, currencies, products). Initial loads before incremental kicks in. Anything under ~1M rows where snapshot semantics beat tracking complexity.
You want a slice, not the whole table. Yesterday's orders, customers in a region, the top 10 % most active users. Anything expressible as SQL WHERE.
Large tables that grow over time. Datavor remembers the last sync's max updated_at (or any timestamp/sequence column you pick) and fetches only what's changed since. Orders of magnitude faster than full sync for log-shaped data.
The data crosses a boundary where it needs to change shape. Anonymize PII before warehouse load. Normalize phone numbers. Cents-to-dollars. Lower-case emails. Combine first+last into a single name. Anything column-level, while syncing.
One-off transformations, schema migrations, custom joins. When the answer requires more than table-to-table movement. Datavor doesn't try to abstract SQL away — it embraces it for the cases where it's the right tool.
Inspecting data before deciding what to do with it. Sample a few rows. Spot-check after a sync. Most-recent log entries. Quick "what does this column actually look like" lookups during conversation.
One concrete example, top to bottom. Source data is messy — leading whitespace, mixed-case emails, cents-as-integers, raw timestamps. Target needs it clean. Here's the recipe, and what it produces.
-- 3 rows, raw from production id: 4821 email: " Alice@ACME.COM" amount_cents: 12950 created_at: "2026-05-19 14:23:11" -- no TZ specified id: 4822 email: "BOB@example.org" amount_cents: 8400 created_at: "2026-05-19 14:23:42" id: 4823 email: " carol@test.io " amount_cents: 22500 created_at: "2026-05-19 14:24:08"
// Saved by save_recipe // Reused via apply_recipe { "name": "prod_to_warehouse", "version": 3, "transforms": [ { "column": "email", "ops": ["trim", "lowercase"] }, { "column": "amount_cents", "op": "divide", "by": 100, "rename_to": "amount_usd" }, { "column": "created_at", "op": "to_utc", "source_tz": "America/New_York" } ] }
-- After applying recipe id: 4821 email: "alice@acme.com" amount_usd: 129.50 created_at: "2026-05-19 18:23:11Z" id: 4822 email: "bob@example.org" amount_usd: 84.00 created_at: "2026-05-19 18:23:42Z" id: 4823 email: "carol@test.io" amount_usd: 225.00 created_at: "2026-05-19 18:24:08Z"
Use transform_preview to see the target output on sample data before running the sync. No more "let's see what happens" with production data.
A 100,000-row sync hits a malformed row at position 47,213. What happens? With most ETL tools, the answer is "everything stops." With Datavor, it's "one row fails, the other 99,999 succeed, you see exactly what broke."
RESULT Job aborts at row 47,213. Target left in partial state. You re-run, hit the same row, abort again. You manually find and fix the bad row, then re-run the entire 100k-row sync from the beginning.
RESULT Bad row gets quarantined with its error. Sync continues to the end. Final report: {success: 99,999, quarantined: 1, error: "varchar overflow on email column"}. ErrorLearner records the pattern. Re-run skips the row until you fix or override.
Quarantined rows write to ~/.datavor/quarantine/<job_id>.jsonl with full row content and error context. Your AI can read them, suggest fixes, and re-attempt — all from inside the same conversation.
Source and target tables rarely have identical schemas. Different column names, different types, missing columns. Datavor reconciles automatically — and when it can't, it asks instead of guessing.
Datavor reads both schemas via describe_table, matches by name first, then by name-similarity (customer_email → email), applies type coercion where safe (int → bigint), and prompts only for ambiguous cases. Columns it can't map (like a brand-new tier) get surfaced as a SuggestionEngine recommendation — not silently dropped.
The Sync & Transform tools are exposed through MCP. Your AI tool reads the conversation, picks the right one, fills in the parameters. Full reference in the docs.
| Tool | Purpose |
|---|---|
sync_table | Full sync — truncate target, reload from source. Idempotent. |
sync_table_partial | Sync only rows matching a SQL WHERE clause. |
sync_table_incremental | Sync only new or updated rows, using a cursor column. |
sync_table_with_transforms | Sync with inline column-level transforms or a named recipe. |
execute_query | Run raw SQL — SELECT, INSERT, UPDATE, DELETE, DDL. |
get_table_data | Fetch rows from a table with optional WHERE / ORDER BY / LIMIT. |
Six things people actually say to Datavor every day, and which sync mode each maps to. None of them require knowing the mode names.
What's hard is doing it reliably, transformatively, and from natural language. Datavor's six modes cover the spectrum without forcing you to learn yet another DSL.