Smart Joins: Automatically Connect Your Data

AUM's Smart Join engine automatically detects relationships between multiple datasets and suggests optimal join strategies.

Overview

Unlike traditional tools that require you to manually specify join keys, AUM uses fuzzy matching and value overlap analysis to find connections automatically.

How Smart Joins Work

Detection Methods

Exact Name Matching
- Identifies columns with identical names across datasets
- Example: customer_id in both Orders and Customers tables
Fuzzy Name Matching
- Finds similar column names using string similarity
- Example: cust_id ↔ customer_id (68% similarity match)
Value Overlap Analysis
- Compares actual data values between columns
- Calculates overlap percentage
- Example: 85% of order_customer_id values exist in customers.id
Pattern Recognition
- Identifies ID-like patterns (e.g., CUST-0001, ORD-12345)
- Detects hierarchical relationships
Cardinality Analysis
- Determines relationship type (1:1, 1:N, N:M)
- Suggests appropriate join type

Join Suggestions

When you upload multiple files, AUM displays suggested joins:

🔗 Suggested Join #1 (Confidence: 92%)
Left: orders.csv → customer_id
Right: customers.csv → id
Type: LEFT JOIN (recommended)
Expected Rows: 1,247 (no row loss)
Overlap: 98% of orders have matching customers

Confidence Scoring

High (>75%): Strong relationship detected, safe to join
Medium (50-75%): Likely match, review before joining
Low (<50%): Weak relationship, manual verification needed

Performing Joins

UI Workflow

Upload Multiple Files: Drag and drop 2-5 files
Review Suggestions: Click "Smart Joins" panel
Preview Join: Click "Preview" to see sample output
Execute: Click "Apply Join" to create merged dataset

Join Types Supported

Inner Join

Returns only matching rows from both tables
Use When: You only want records with matches

Left Join

Returns all rows from left table + matches from right
Use When: Main table is primary, enriching with secondary data

Right Join

Returns all rows from right table + matches from left
Use When: Opposite of left join

Full Outer Join

Returns all rows from both tables
Use When: Want to see all data, match or not

Advanced Join Features

Multi-Column Joins

Join on composite keys:

Join Condition:
  orders.region = stores.region AND
  orders.store_id = stores.id

Sequential Joins (3+ Files)

AUM can chain joins:

Join Orders + Customers
Join Result + Products
Final merged dataset

Join Warnings

AUM warns you about potential issues:

⚠️ Warning: Only 45% overlap detected 💡 Suggestion: Consider left join to avoid data loss ⚠️ Warning: Join would create 3x row explosion (many-to-many)

Join Quality Metrics

After joining, AUM reports:

Matched Rows: 1,245 out of 1,250 (99.6%)
Null Values Introduced: 15 rows missing right table data
Row Expansion: 1x (no duplication)
Columns Added: 8 new columns from right table

Manual Join Builder

For complex scenarios, use Manual Join Builder:

Click "Advanced Joins" tab
Select left and right datasets
Choose join columns manually
Pick join type
Preview and execute

API: Programmatic Joins

// Step 1: Get join suggestions
const suggestions = await fetch('/api/joins/join-suggestions', {
  method: 'POST',
  body: formData // containing multiple files
});

// Step 2: Preview a suggested join
const preview = await fetch('/api/joins/join-preview', {
  method: 'POST',
  body: JSON.stringify({
    left_file: 'orders.csv',
    right_file: 'customers.csv',
    left_column: 'customer_id',
    right_column: 'id',
    join_type: 'left'
  })
});

// Step 3: Execute join
const result = await fetch('/api/joins/join-complete', {
  method: 'POST',
  body: joinParams
});

Best Practices

✅ DO:

Use consistent ID formats across files
Name join keys clearly (e.g., customer_id, not id1)
Preview joins before executing
Check for null values after joining

❌ DON'T:

Join on non-unique columns (causes row explosion)
Ignore low confidence warnings
Assume all IDs will match

Use Cases

E-commerce: Orders + Products + Customers

orders.csv (order_id, customer_id, product_id, amount)
    ↓ JOIN
customers.csv (id, name, email, segment)
    ↓ JOIN
products.csv (id, name, category, price)
    ↓ RESULT
Enriched dataset with customer names, product details, segments

Manufacturing: Production + Machines + Operators

production_log.csv (batch_id, machine_id, operator_id, output)
    ↓ JOIN
machines.csv (id, name, line, capacity)
    ↓ JOIN
operators.csv (id, name, shift, experience)
    ↓ RESULT
Full production context with machine and operator details

Troubleshooting

Q: No join suggestions appeared A: Ensure your files have at least one overlapping column or ID-like field

Q: Join created too many rows A: You likely have a many-to-many relationship. Check join key uniqueness.

Q: Many null values after join A: Low overlap between datasets. Consider using full outer join to see unmatched rows.

Q: Can I undo a join? A: Yes, original datasets remain unchanged. Simply delete the joined dataset.

Performance Tips

Join smaller tables first, then larger ones
Limit preview to 100 rows for speed
For 5+ files, use sequential join strategy

Next: Master Collaboration Features