Smart Joins: Automatically Connect Your Data
AUM's Smart Join engine automatically detects relationships between multiple datasets and suggests optimal join strategies.
Overview
Unlike traditional tools that require you to manually specify join keys, AUM uses fuzzy matching and value overlap analysis to find connections automatically.
How Smart Joins Work
Detection Methods
Exact Name Matching
- Identifies columns with identical names across datasets
- Example:
customer_idin both Orders and Customers tables
Fuzzy Name Matching
- Finds similar column names using string similarity
- Example:
cust_id↔customer_id(68% similarity match)
Value Overlap Analysis
- Compares actual data values between columns
- Calculates overlap percentage
- Example: 85% of
order_customer_idvalues exist incustomers.id
Pattern Recognition
- Identifies ID-like patterns (e.g., CUST-0001, ORD-12345)
- Detects hierarchical relationships
Cardinality Analysis
- Determines relationship type (1:1, 1:N, N:M)
- Suggests appropriate join type
Join Suggestions
When you upload multiple files, AUM displays suggested joins:
🔗 Suggested Join #1 (Confidence: 92%)
Left: orders.csv → customer_id
Right: customers.csv → id
Type: LEFT JOIN (recommended)
Expected Rows: 1,247 (no row loss)
Overlap: 98% of orders have matching customers
Confidence Scoring
- High (>75%): Strong relationship detected, safe to join
- Medium (50-75%): Likely match, review before joining
- Low (<50%): Weak relationship, manual verification needed
Performing Joins
UI Workflow
- Upload Multiple Files: Drag and drop 2-5 files
- Review Suggestions: Click "Smart Joins" panel
- Preview Join: Click "Preview" to see sample output
- Execute: Click "Apply Join" to create merged dataset
Join Types Supported
Inner Join
- Returns only matching rows from both tables
- Use When: You only want records with matches
Left Join
- Returns all rows from left table + matches from right
- Use When: Main table is primary, enriching with secondary data
Right Join
- Returns all rows from right table + matches from left
- Use When: Opposite of left join
Full Outer Join
- Returns all rows from both tables
- Use When: Want to see all data, match or not
Advanced Join Features
Multi-Column Joins
Join on composite keys:
Join Condition:
orders.region = stores.region AND
orders.store_id = stores.id
Sequential Joins (3+ Files)
AUM can chain joins:
- Join Orders + Customers
- Join Result + Products
- Final merged dataset
Join Warnings
AUM warns you about potential issues:
⚠️ Warning: Only 45% overlap detected 💡 Suggestion: Consider left join to avoid data loss ⚠️ Warning: Join would create 3x row explosion (many-to-many)
Join Quality Metrics
After joining, AUM reports:
- Matched Rows: 1,245 out of 1,250 (99.6%)
- Null Values Introduced: 15 rows missing right table data
- Row Expansion: 1x (no duplication)
- Columns Added: 8 new columns from right table
Manual Join Builder
For complex scenarios, use Manual Join Builder:
- Click "Advanced Joins" tab
- Select left and right datasets
- Choose join columns manually
- Pick join type
- Preview and execute
API: Programmatic Joins
// Step 1: Get join suggestions
const suggestions = await fetch('/api/joins/join-suggestions', {
method: 'POST',
body: formData // containing multiple files
});
// Step 2: Preview a suggested join
const preview = await fetch('/api/joins/join-preview', {
method: 'POST',
body: JSON.stringify({
left_file: 'orders.csv',
right_file: 'customers.csv',
left_column: 'customer_id',
right_column: 'id',
join_type: 'left'
})
});
// Step 3: Execute join
const result = await fetch('/api/joins/join-complete', {
method: 'POST',
body: joinParams
});
Best Practices
✅ DO:
- Use consistent ID formats across files
- Name join keys clearly (e.g.,
customer_id, notid1) - Preview joins before executing
- Check for null values after joining
❌ DON'T:
- Join on non-unique columns (causes row explosion)
- Ignore low confidence warnings
- Assume all IDs will match
Use Cases
E-commerce: Orders + Products + Customers
orders.csv (order_id, customer_id, product_id, amount)
↓ JOIN
customers.csv (id, name, email, segment)
↓ JOIN
products.csv (id, name, category, price)
↓ RESULT
Enriched dataset with customer names, product details, segments
Manufacturing: Production + Machines + Operators
production_log.csv (batch_id, machine_id, operator_id, output)
↓ JOIN
machines.csv (id, name, line, capacity)
↓ JOIN
operators.csv (id, name, shift, experience)
↓ RESULT
Full production context with machine and operator details
Troubleshooting
Q: No join suggestions appeared A: Ensure your files have at least one overlapping column or ID-like field
Q: Join created too many rows A: You likely have a many-to-many relationship. Check join key uniqueness.
Q: Many null values after join A: Low overlap between datasets. Consider using full outer join to see unmatched rows.
Q: Can I undo a join? A: Yes, original datasets remain unchanged. Simply delete the joined dataset.
Performance Tips
- Join smaller tables first, then larger ones
- Limit preview to 100 rows for speed
- For 5+ files, use sequential join strategy
Next: Master Collaboration Features