Harbinger Explorer

Back to Knowledge Hub
solutions
Published: Updated:

The Data Source Inventory Tool Your Team Actually Needs

10 min read·Tags: data source inventory, data catalog, data governance, PII detection, DuckDB SQL, data discovery, data management

The Data Source Inventory Tool Your Team Actually Needs

You open a Confluence page labeled "Data Sources — Master List" and immediately feel that sinking feeling. It was last updated eight months ago. Half the links are broken. Someone added a note in red saying "CHECK WITH DBA BEFORE USING" next to three entries, but the DBA who wrote that note left the company in October. You close the tab and go ask a colleague.

Sound familiar? This is data source sprawl — and it's costing your team more than you probably realize.


The Real Cost of Scattered Data Sources

For most data teams, the concept of a "data source inventory" exists somewhere between wishful thinking and a perpetually postponed backlog ticket. The sources are there — PostgreSQL databases, S3 buckets, REST APIs, Google Sheets exports, third-party vendor feeds — but they're scattered across a dozen Slack threads, half-finished wiki pages, and the institutional knowledge of whoever has been at the company longest.

The result is predictable waste.

Time lost to discovery. Before any meaningful analysis can begin, analysts often spend 30–60 minutes just locating the right data source, confirming it's still active, and figuring out how to connect to it. For a team of five analysts running three projects simultaneously, that's potentially 15 hours per week lost before a single query is written.

Duplicated effort. Without a central catalog, engineers build their own mental maps. Two analysts end up building similar pipelines to the same source because neither knew the other was doing it. Shadow datasets multiply. Data quality suffers.

Governance gaps. Which data sources contain PII? Which are subject to GDPR retention policies? Which vendor API has a usage cap that will trigger overage fees? If you can't answer these questions from a single interface, you don't have visibility — you have hope.

Onboarding friction. A new data engineer joins the team. On day one, they're handed a five-year-old architecture diagram and told to "figure out the data layer." Two weeks later, they're still finding new sources they didn't know existed. That's not their fault — it's a documentation problem.

The pain isn't abstract. Every day your team operates without a proper data source inventory, you're paying for it in slow sprint velocity, analyst frustration, and decisions made on incomplete information.


How Teams Usually Try to Solve This (and Why It Falls Short)

Let's be honest about the existing approaches. They're not useless — they just don't scale.

The Wiki/Confluence approach. Someone writes a page. It's reasonably accurate on day one. By month three, sources have been added, removed, or restructured and nobody updated the page. The wiki works as a snapshot but fails as a living catalog because maintaining it requires manual discipline that teams rarely sustain under sprint pressure.

The shared spreadsheet. Similar problem. More accessible, perhaps, but still manual. And spreadsheets can't actually connect to a data source to verify it's alive, check its current schema, or surface what columns it contains.

Custom internal tooling. Some mature data engineering teams build their own metadata catalogues. This is the right instinct, but the execution cost is brutal — months of engineering time to build, ongoing maintenance burden, and the tool inevitably becomes a priority-zero project the moment a business-critical pipeline needs attention.

Commercial data catalogs (Alation, Collibra, Atlan). These are excellent products built for enterprise scale. They're also priced for enterprise budgets. For a startup or a mid-market company that needs a practical inventory of 20–50 data sources, paying €1000+/month for catalog software is overkill.

What's missing is something in the middle: fast to set up, accurate (not manually maintained), searchable, and genuinely useful without requiring a three-month implementation project.


What a Good Data Source Inventory Tool Actually Looks Like

Before we get to specific tools, let's define what "good" means in this context. A practical data source inventory tool should do five things:

1. Catalog sources automatically. You shouldn't have to manually describe every column in every table. The tool should be able to connect to a source — or crawl it, in the case of APIs and file sources — and extract schema information on its own.

2. Make everything searchable. The value of a catalog is proportional to how quickly you can find what you need. Full-text search across source names, column names, descriptions, and tags is non-negotiable.

3. Surface metadata alongside data. Knowing that a column named user_email exists is useful. Knowing that it's been flagged as PII, that it contains ~400,000 unique values, and that it's joined in three downstream pipelines is genuinely powerful.

4. Support ad-hoc querying. Inventory is only step one. The moment someone finds the right source, they want to explore it. A tool that catalogs sources but forces you to switch to a different tool to query them creates unnecessary friction.

5. Stay up to date without manual effort. Schemas change. APIs are versioned. Tables get deprecated. An inventory that's only accurate at the time of initial crawl becomes a liability. Automatic recrawling on a schedule is essential.


Introducing Harbinger Explorer: A Searchable Catalog That Queries

Harbinger Explorer is a browser-based data source inventory tool built around the idea that cataloging and querying should happen in the same place.

Here's the core workflow:

You give Harbinger Explorer a URL — whether that's a REST API endpoint, a data portal, a public dataset, or any web-accessible data source. The AI Crawler takes over from there. It explores the source, maps its structure, identifies column types, flags potential PII fields, and builds a schema representation automatically. You don't write documentation; the crawler writes it for you.

The result lands in your personal catalog: a searchable, browsable library of every data source you've added. Each entry shows you the source URL, the extracted schema, column names and types, and any governance flags the crawler identified (PII columns, sensitive field types). Everything is full-text searchable — type "revenue" and find every column across every source in your catalog that might contain revenue data.

And then, without switching tools, you can query directly. Harbinger Explorer uses DuckDB SQL to let you run queries against your cataloged sources from a browser interface. No local database client. No SSH tunnel. No "wait, which connection string is it again?" You found the source, you see the schema, you write the query — in one place.


How to Build Your Data Source Inventory in Harbinger Explorer

Getting started takes minutes, not weeks.

Step 1: Register and log in. Go to harbingerexplorer.com/register. No credit card required. You get a 7-day free trial with access to all features, plus 8 demo data sources already loaded so you can explore before adding your own.

Step 2: Add your first data source. Click "New Source" and paste in a URL. This can be a public API endpoint, a CSV/JSON download link, or any web-accessible data resource. The AI Crawler starts immediately — most sources are processed in under 30 seconds.

Step 3: Review the generated schema. Once crawled, Harbinger Explorer shows you the column structure: names, inferred types, example values (where available), and any PII or governance flags the crawler detected. You can add tags and notes at this point to make the source more discoverable.

Step 4: Search across your catalog. As you add more sources, use the search interface to find what you need. Search by column name, data type, tag, or free text. Your catalog becomes a living index of everything your team has access to.

Step 5: Query without leaving the browser. Click into any source and open the SQL editor. Write DuckDB SQL against your cataloged data — filter, aggregate, join across sources, export results. The catalog and the query interface are unified.


Try it yourselfStart exploring for free. No credit card. 8 demo data sources ready to query.


Advanced Features: Governance, PII, and Team Workflows

For teams managing data at scale, Harbinger Explorer goes beyond basic inventory.

PII Detection. The AI Crawler automatically flags columns that are likely to contain personally identifiable information — names, email addresses, phone numbers, government IDs, and similar. This gives your team an immediate starting point for GDPR compliance reviews without manual annotation. Every new source you crawl comes with a ready-made PII audit.

Column Mapping. When you have multiple data sources that contain related information — say, a customer_id column that appears in five different sources — Column Mapping surfaces those relationships automatically. This is particularly valuable when building joins across sources or when tracing data lineage.

Governance Tagging. Beyond PII, you can tag sources with custom governance labels: sensitive, external-vendor, deprecated, approved-for-analytics, whatever fits your workflow. Tags are searchable and can be applied in bulk after a crawl.

Recrawling (Pro). Data sources don't stand still. APIs get new endpoints. Tables gain new columns. Harbinger Explorer's Pro plan includes scheduled recrawling so your catalog stays current automatically. You define the frequency; Harbinger handles the rest.

DuckDB SQL with cross-source joins. One of the more powerful features for advanced users: because all your sources are cataloged in the same system, you can write SQL queries that join across them. Pull data from a public API source and join it against a CSV you've cataloged, all in the same query window.


Comparison: Old Way vs. Harbinger Explorer

FeatureSpreadsheet/WikiHarbinger Explorer
Initial setup timeHours to daysMinutes per source
Schema accuracyManual, often staleAI-crawled, auto-updated
Full-text searchBasic (Ctrl+F)Built-in, column-level
PII detectionManual reviewAutomatic flagging
Ad-hoc queryingSeparate tool requiredInline DuckDB SQL
RecrawlingManual re-entryScheduled (Pro)
Onboarding new team membersTour the spreadsheetShare catalog URL
Cost (team of 5)Staff time onlyFrom €8/month per user

Pricing: Starter at €8/month (25 chats/day, 10 crawls/month) or Pro at €24/month (200 chats/day, 100 crawls/month, recrawling, priority support). See pricing →

Free 7-day trial, no credit card required. Start free →


The Hidden Value of Consolidated Schema Knowledge

There's a compounding benefit to maintaining a proper data source inventory that's easy to underestimate at first: it makes every future query faster.

When your catalog is comprehensive and current, the pattern of work changes. An analyst no longer starts a project by asking "what data do we have that might be relevant?" They start by searching the catalog. The answer is there already — column names, data types, example values, governance flags. The analyst goes from question to first query in minutes rather than hours.

This matters disproportionately for certain types of work:

Incident response. When something breaks in a pipeline and you need to trace data from source to output fast, a searchable catalog is the difference between a 20-minute investigation and a 2-hour one. Every source is documented. Every column is named. You're not asking around; you're looking it up.

Cross-functional collaboration. When a product manager asks "do we have data on X?", the traditional answer is "let me check with the data team, it'll take a day or two." With a proper catalog, the answer is "search for X in the catalog." Non-technical stakeholders can self-serve, which changes the relationship between data teams and the rest of the organization.

Compliance and audit readiness. When a GDPR access request comes in and legal needs to know everywhere a user's data might live, a cataloged inventory with PII flags means you can produce a comprehensive answer quickly. Without it, you're doing a manual sweep of every system — stressful, slow, and error-prone.

New project scoping. Starting a new analytical project often involves a phase of "what do we even have to work with?" A complete catalog turns this from a discovery phase that takes a week into a 30-minute browsing session.

None of these benefits are hypothetical. They're the natural result of one thing: knowing what data you have and where to find it.


FAQ: Data Source Inventory with Harbinger Explorer

Can Harbinger Explorer connect to private or internal databases? Currently, Harbinger Explorer is optimized for web-accessible data sources — public APIs, hosted datasets, CSV/JSON endpoints accessible via URL. Support for private database connections (PostgreSQL, MySQL, etc.) is on the roadmap. For internal sources today, you can use exported schema files or sample data exports as a starting point.

How does the AI Crawler handle authentication? For sources that require API keys or token-based authentication, you can configure authentication headers when adding a source. Harbinger Explorer stores these securely and uses them during crawling and querying sessions.

How accurate is the PII detection? The PII detection is heuristic-based, drawing on column names, data patterns, and field structures. It's designed to surface likely PII candidates for human review — not to replace a legal compliance process. Think of it as an automated first pass that saves your team from manual annotation.

Is my data stored by Harbinger Explorer? Harbinger Explorer stores schema metadata and query results from your sessions, but it does not ingest or persistently store the underlying source data. Queries run against live sources in your session context.

What if I already use a commercial data catalog? Harbinger Explorer is complementary to enterprise catalog tools for teams that use them. Many users find it useful as a lightweight exploration layer — quickly discovering and querying sources before deciding which ones warrant formal cataloging in a heavier system.

How does the Starter plan differ from Pro for catalog use? The Starter plan at €8/month includes 10 crawls per month, which is sufficient for a small catalog of up to 10 active sources (one initial crawl each). Pro at €24/month provides 100 crawls plus scheduled recrawling — essential for catalogs where sources update frequently and you need the schema to stay current automatically.


Conclusion: Stop Asking Around, Start Searching

The data source inventory problem is one of those quiet productivity drains that rarely gets prioritized until it's genuinely painful. By then, you've already lost hundreds of analyst-hours to discovery overhead, built half a dozen redundant pipelines, and onboarded three engineers who are still finding sources by asking colleagues.

Harbinger Explorer is designed to make this easier from day one. Add sources, let the AI Crawler build your catalog automatically, search across everything you've added, and query directly in the browser. It's not a multi-year implementation project — it's a tool you can be productive in within 15 minutes of signing up.

Your team's data knowledge shouldn't live in someone's head or in a Confluence page from 2022. Put it somewhere searchable. A catalog that's accurate, current, and queryable is one of the highest-leverage investments a data team can make — and with Harbinger Explorer, the setup cost is measured in minutes, not months.


Ready to skip the setup and start exploring? Try Harbinger Explorer free →



Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...