Harbinger Explorer

Back to Knowledge Hub
solutions
Published:

Multi-Source Data Join in the Browser: Skip the Python Pipeline

13 min read·Tags: multi source data join, browser sql, duckdb, data analysis, no-code analytics, csv join api

Multi-Source Data Join in the Browser: Skip the Python Pipeline

Your analyst needs to answer one question: which products from our catalog are currently out of stock at the suppliers we ordered from last quarter, and how much revenue are we at risk of missing?

Answering this question requires three data sources: your product catalog (a CSV), your supplier orders (a database export), and live supplier inventory (an API). In a normal data stack, joining these together means writing a Python script, setting up a virtual environment, loading three sources into Pandas dataframes, writing merge logic for each join, handling dtype mismatches, and outputting a result that lives in your terminal session until you export it to yet another file.

That's a two-hour task for someone comfortable with Python. For a business analyst who primarily works in SQL and spreadsheets, it might take a day — or they might escalate it to engineering.

In Harbinger Explorer, it's one SQL query in the browser.


Try it yourselfStart exploring for free. No credit card. 8 demo data sources ready to query.


Why Multi-Source Data Joins Are So Painful Today

The Heterogeneous Data Problem

Modern business data doesn't live in one place. It never has and it never will. A typical analysis involves:

  • Files: CSVs from exports, Excel sheets from partners, JSON dumps from legacy systems
  • APIs: REST endpoints from SaaS tools, CRM systems, financial data providers, internal microservices
  • Databases: PostgreSQL exports, BigQuery tables, SQLite dumps, Parquet files

Each of these sources has its own format, its own authentication, its own schema conventions. Joining them requires a layer that can speak all these languages simultaneously.

In the traditional stack, that layer is Python + Pandas (or Spark for larger datasets). The problem isn't that Pandas is bad — it's excellent — it's that every new join requires a new script, a new environment, and a new round of debugging dtype mismatches and missing values.

The "I'll Just VLOOKUP It" Trap

Non-technical analysts facing a multi-source join often reach for Excel or Google Sheets. They export everything to CSV, open it in Sheets, and start building VLOOKUP chains. This works for small datasets (a few thousand rows) but breaks down quickly:

  • VLOOKUP is one-directional and returns only the first match
  • Many-to-many joins are impossible without helper columns
  • Sheets with 100k+ rows become painfully slow
  • The resulting formulas are impossible to audit or reproduce

The VLOOKUP trap locks analyses in spreadsheets that nobody wants to maintain and nobody trusts to be correct.

Why Python Is Necessary — But Shouldn't Be Required

Python with Pandas is the right tool for complex, large-scale multi-source joins. But it shouldn't be the only tool. When an analyst needs to answer a simple business question that happens to involve two data sources, spinning up a Python environment and writing a 50-line script is massive friction.

Consider the time budget:

  • 5 minutes to create and activate a virtual environment
  • 10 minutes to install dependencies
  • 15 minutes to load and inspect each data source
  • 20 minutes to write and debug the merge logic
  • 10 minutes to handle dtype and null issues
  • 15 minutes to validate the output

That's over an hour of setup and boilerplate before you've answered the business question. The question itself takes five minutes to answer once the data is joined correctly.

The SQL Advantage

SQL was designed for joins. The entire relational model is built around the idea of combining data from multiple tables on shared keys. A database with proper indexing can join millions of rows in milliseconds. And SQL is the lingua franca of data analysis — far more analysts know SQL than Python.

The problem isn't SQL. The problem is getting your heterogeneous data into a SQL engine without building a whole data warehouse first.

What Multi-Source Join Tools Currently Exist

Apache Spark / Databricks

For enterprise-scale joins across multiple large sources, Spark is the right tool. You can load data from S3, Delta Lake, APIs, and more, and join across them with Spark SQL. The results are production-grade.

The setup overhead is significant: you need a Spark cluster, cluster configuration, a data engineering team to maintain it, and significant infrastructure costs. This is not a solution for ad-hoc analysis by a business analyst.

dbt (data build tool)

dbt is excellent for building structured, version-controlled data transformation pipelines. You define models in SQL, chain them together, and run them against your data warehouse. It's the right approach for recurring, governed transformations.

It's not the right approach for ad-hoc multi-source joins: you need a running data warehouse, dbt installed and configured, and the ability to write dbt models. A one-off join across a CSV and an API is overkill for dbt.

Google Sheets + ImportJSON / ImportRange

Power users in Google Sheets use custom functions to pull in API data and combine it with sheet data. This works for simple cases but has severe limitations: rate limits, row limits, formula complexity, and no proper JOIN semantics. The result is often brittle, slow, and impossible to maintain.

Python + Pandas in Jupyter

The most common approach for analysts with Python skills. Flexible, powerful, good library ecosystem. But requires Python knowledge, environment management, and produces results that aren't shareable without re-running the notebook.

The Better Approach: SQL JOINs Across Any Data Source in Your Browser

Harbinger Explorer provides a SQL query engine (powered by DuckDB) that treats every registered data source as a queryable table. Files, APIs, crawled web data — all accessible in the same SQL query, with full JOIN support, in your browser.

No Python. No ETL pipeline. No database cluster. Register your sources, write SQL, get answers.

How Multi-Source Joins Work in Harbinger Explorer

Step 1: Register All Your Sources

Add each data source to Harbinger Explorer:

  • Upload CSV and Excel files directly
  • Paste API endpoint URLs (with authentication headers if needed)
  • Use the AI Crawler to fetch and parse structured web data

Each source is crawled, schema-inferred, and made available as a named table.

Step 2: Write a JOIN Query

In the Harbinger Explorer query editor, write standard SQL JOINs across your sources. The business question from the opening — products at risk due to supplier stockouts — becomes:

SELECT
  pc.product_id,
  pc.product_name,
  pc.category,
  pc.price,
  so.quantity_ordered,
  so.order_date,
  si.current_stock AS supplier_current_stock,
  pc.price * so.quantity_ordered AS revenue_at_risk
FROM product_catalog pc
JOIN supplier_orders so
  ON pc.product_id = so.product_id
  AND so.order_date >= CURRENT_DATE - INTERVAL '90 days'
JOIN supplier_inventory_api si
  ON so.supplier_id = si.supplier_id
  AND pc.product_id = si.product_id
WHERE si.current_stock = 0
ORDER BY revenue_at_risk DESC

Here product_catalog is an uploaded CSV, supplier_orders is a database export file, and supplier_inventory_api is a live API source. The JOIN works across all three seamlessly.

Step 3: Run and Analyze

The query runs in DuckDB, one of the fastest analytical SQL engines available, and returns results in seconds. You can immediately apply further SQL filters, aggregations, or window functions:

WITH at_risk AS (
  -- the query above
),
by_category AS (
  SELECT
    category,
    COUNT(*) AS products_at_risk,
    SUM(revenue_at_risk) AS total_revenue_at_risk
  FROM at_risk
  GROUP BY category
)
SELECT
  category,
  products_at_risk,
  total_revenue_at_risk,
  ROUND(total_revenue_at_risk * 100.0 / SUM(total_revenue_at_risk) OVER (), 1) AS pct_of_total
FROM by_category
ORDER BY total_revenue_at_risk DESC

Step 4: Save and Share

Save the query with a descriptive name. Anyone on your team with access to Harbinger Explorer can run it and get the same results. No "send me your Python script." No "which version of the notebook did you use?" The query is the single source of truth.


Pricing: Starter at €8/month (25 chats/day, 10 crawls/month) or Pro at €24/month (200 chats/day, 100 crawls/month, recrawling, priority support). See pricing →

Free 7-day trial, no credit card required. Start free →


Advanced Multi-Source Join Patterns

FULL OUTER JOINs for Data Reconciliation

When reconciling two sources — checking what's in one but not the other — use FULL OUTER JOIN:

SELECT
  COALESCE(a.customer_id, b.customer_id) AS customer_id,
  a.customer_name AS name_in_crm,
  b.customer_name AS name_in_billing,
  CASE
    WHEN a.customer_id IS NULL THEN 'Only in Billing'
    WHEN b.customer_id IS NULL THEN 'Only in CRM'
    ELSE 'In Both'
  END AS reconciliation_status
FROM crm_customers a
FULL OUTER JOIN billing_customers b ON a.customer_id = b.customer_id
WHERE a.customer_id IS NULL OR b.customer_id IS NULL
ORDER BY reconciliation_status

This reconciliation — finding discrepancies between two systems — is one of the most common multi-source join use cases and normally requires Python or dedicated ETL tooling.

Window Functions Across Joined Sources

Once your data is joined, window functions let you add context without collapsing rows:

SELECT
  s.sale_id,
  s.amount,
  c.customer_tier,
  c.country,
  AVG(s.amount) OVER (PARTITION BY c.country) AS avg_sale_by_country,
  s.amount - AVG(s.amount) OVER (PARTITION BY c.customer_tier) AS vs_tier_average,
  RANK() OVER (PARTITION BY c.country ORDER BY s.amount DESC) AS rank_in_country
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id

Column Mapping for Cross-Source Key Alignment

Real-world data sources often use different names for the same concept: customer_id, cust_id, CustomerID, id. Harbinger Explorer's Column Mapping feature lets you define canonical names and map variants to them across sources, so your JOIN keys align without manually pre-processing every file.

Joining API Data with Historical Files

A particularly powerful pattern: join live API data against historical archives. For example, join current market prices from an API against a historical CSV of your portfolio positions:

SELECT
  p.ticker,
  p.shares_held,
  p.average_cost,
  mp.current_price,
  p.shares_held * mp.current_price AS current_value,
  p.shares_held * (mp.current_price - p.average_cost) AS unrealized_pnl,
  ROUND((mp.current_price - p.average_cost) / p.average_cost * 100, 2) AS pct_return
FROM portfolio_csv p
JOIN market_prices_api mp ON p.ticker = mp.symbol
ORDER BY unrealized_pnl DESC

This join — historical file plus live API — produces a real-time portfolio P&L view without any infrastructure beyond Harbinger Explorer.

Common Mistakes with Multi-Source Joins

Mistake 1: Joining on columns with different data types If customer_id is stored as an integer in one source and a string in another, the join silently returns zero rows. Always check types before joining:

SELECT TYPEOF(customer_id) AS type, COUNT(*) FROM source_a GROUP BY 1
UNION ALL
SELECT TYPEOF(customer_id) AS type, COUNT(*) FROM source_b GROUP BY 1

Cast explicitly if needed: CAST(a.customer_id AS VARCHAR) = b.customer_id

Mistake 2: Forgetting DISTINCT when joining one-to-many If source B has multiple rows per customer, an INNER JOIN with source A multiplies rows. Always check your row count before and after a join:

SELECT COUNT(*) FROM source_a;  -- expect: N rows
SELECT COUNT(*) FROM source_a JOIN source_b ON ...;  -- if much larger than N, you have a fan-out join

Mistake 3: Using INNER JOIN when you want LEFT JOIN INNER JOIN drops rows from the left table that don't match the right. If you want to keep all rows from the left table (with nulls for missing matches), use LEFT JOIN:

SELECT
  a.*,
  b.extra_field
FROM main_table a
LEFT JOIN supplementary_data b ON a.id = b.id
-- rows with no match in b will have null for b.extra_field

Mistake 4: Not validating join key uniqueness Before joining, verify that your join key is actually unique in the table you expect it to be unique in:

SELECT customer_id, COUNT(*) AS occurrences
FROM customers
GROUP BY customer_id
HAVING COUNT(*) > 1
-- If this returns rows, your join key is not unique

Feature Comparison

CapabilityPython + Pandasdbt + WarehouseHarbinger Explorer
Setup time for one-off join30-60 minutesHours/daysUnder 5 minutes
SQL interface❌ (Python API)
Join CSV + API + fileComplex✅ Native
Browser-based, no install
Shareable saved queries
PII governanceManual✅ (with tooling)✅ Built-in

FAQ

How many sources can I join in one query? DuckDB (the engine powering Harbinger Explorer) supports JOINs across as many tables as your query requires. In practice, most analyses join 2-5 sources. There's no artificial limit.

Does it work with large files? Harbinger Explorer handles files up to several hundred MB efficiently. DuckDB is designed for analytical workloads and performs well on large datasets without requiring dedicated infrastructure.

Can non-technical users write JOIN queries? Harbinger Explorer includes an AI assistant that can generate SQL from natural language. Describe your join in plain English ("show me products from the catalog that have zero stock in the supplier inventory") and the AI writes the SQL.

What file formats are supported? CSV, JSON, Excel (XLSX), Parquet, and more. API sources support standard REST JSON responses.

Real-World Case Study: Procurement Team Joins Three Sources to Find Savings

A procurement team at a manufacturing company needed to identify cost-saving opportunities across their supplier relationships. The data they needed lived in three places:

  1. Supplier master data: A CSV exported from their ERP system with supplier names, tiers, payment terms, and category codes
  2. Purchase order history: A database export (CSV) with all POs from the last two years, including line items, quantities, and negotiated prices
  3. Benchmark pricing API: A third-party commodity pricing API that provided current market rates for standard components

The question: "Which components are we buying above current market rate, from which suppliers, and what's the total overspend?"

In their previous workflow, the analyst would:

  1. Export the ERP data to CSV
  2. Pull the database export
  3. Write a Python script to load both, then call the pricing API for each unique component SKU
  4. Merge the three datasets in Pandas
  5. Calculate the overspend per line item
  6. Export to Excel for the category manager

This process took about three hours including debugging and took the analyst's full attention. The result was a one-time analysis — if the category manager wanted to slice it differently (by region, by quarter, by supplier tier), the analyst had to go back and re-run the script with different parameters.

In Harbinger Explorer, the analyst uploaded the two CSVs, registered the pricing API, and wrote:

WITH po_with_market AS (
  SELECT
    po.po_id,
    po.supplier_id,
    po.component_sku,
    po.quantity,
    po.unit_price AS negotiated_price,
    mp.market_unit_price,
    po.quantity * po.unit_price AS total_paid,
    po.quantity * mp.market_unit_price AS market_value,
    po.quantity * (po.unit_price - mp.market_unit_price) AS overspend
  FROM purchase_orders po
  JOIN market_prices_api mp ON po.component_sku = mp.sku
  WHERE po.order_date >= CURRENT_DATE - INTERVAL '365 days'
    AND po.unit_price > mp.market_unit_price
),
supplier_summary AS (
  SELECT
    s.supplier_name,
    s.supplier_tier,
    s.category_code,
    COUNT(DISTINCT pwm.component_sku) AS overpriced_components,
    SUM(pwm.total_paid) AS total_paid,
    SUM(pwm.market_value) AS market_equivalent,
    SUM(pwm.overspend) AS total_overspend,
    ROUND(SUM(pwm.overspend) / SUM(pwm.total_paid) * 100, 1) AS overspend_pct
  FROM po_with_market pwm
  JOIN supplier_master s ON pwm.supplier_id = s.supplier_id
  GROUP BY s.supplier_name, s.supplier_tier, s.category_code
)
SELECT *
FROM supplier_summary
WHERE total_overspend > 10000
ORDER BY total_overspend DESC

The query ran in four seconds. Total time from "upload files" to "result in front of the category manager": 25 minutes, including the time to write the SQL. The category manager could then ask follow-up questions — "show me only Tier 1 suppliers" — and the analyst could modify the WHERE clause and re-run in 30 seconds, without rebuilding anything.

The saved query became a recurring report: every month, the analyst runs it against updated exports and the live API, and the category manager has a fresh overspend report. No Python. No Jupyter notebook. One SQL query in a browser.

The procurement team identified €340k in annual overspend across three supplier relationships in the first analysis — which paid for years of Harbinger Explorer subscriptions in the first month.

Conclusion

Multi-source data joins are one of the most common analytical tasks and one of the most friction-heavy. The choice between "spend two hours writing Python" and "approximate it with VLOOKUP" is a false dichotomy. Harbinger Explorer gives you a third option: write SQL in your browser, across any combination of files and APIs, and get answers in minutes.

The Python pipeline you would have written to answer one question becomes a saved SQL query that your whole team can run — no environment setup, no notebook sharing, no "it worked on my machine."


Ready to skip the setup and start exploring? Try Harbinger Explorer free →



Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...