Harbinger Explorer

Back to Knowledge Hub
solutions
Published:

Automated Data Profiling: Know Your Data Before You Trust It

15 min read·Tags: automated data profiling, data quality, data engineering, PII detection, data catalog, column profiling, data governance

Automated Data Profiling: Know Your Data Before You Trust It

You've just pulled a new dataset from an external API. There are 200,000 rows, 40 columns, and a README that says "data is clean and normalized." Your downstream model needs this data by Friday.

Do you trust it?

If you're honest with yourself: no. You know better. You've been burned before. That "normalized" dataset had six different date formats. The "clean" dataset had 30% null values in the primary join key. The "complete" export had rows missing for the last three months because someone changed the API pagination behavior and nobody noticed.

Before you trust any data, you need to profile it. You need to know what's actually there — not what the documentation says is there.

Automated data profiling is the practice of systematically characterizing every column in a dataset: its data type, null rate, cardinality, value distribution, min/max ranges, most common values, and any patterns that suggest quality issues. Done manually, profiling is a tedious multi-hour process. Done automatically, it takes seconds and gives you more information than most manual profiles ever do.


The Real Cost of Skipping Data Profiling

Data quality issues don't usually announce themselves. They hide in columns you didn't inspect, show up in edge cases you didn't test, and surface in production when the cost of finding them is highest.

Pain point 1: You can't know what you don't measure.

Most engineers do spot-check profiling — df.head(), df.describe(), maybe a null count on the obvious columns. But spot-checks miss systematic problems. A column with 0.1% null values looks fine in a sample. In 200,000 rows, that's 200 null values that will cause join failures or silent aggregation errors downstream. Systematic profiling catches what spot-checks miss.

Pain point 2: Data types lie.

A column typed as INTEGER might contain strings wherever the original export system hit an edge case — the string "N/A" gets stored as-is in a text field, or a numeric field that's 99.9% numbers has 50 rows where someone entered "pending". Your ETL job reads the column as string to avoid errors, and now you can't do arithmetic on it downstream without cleaning it first. Profiling that checks actual value distributions — not just declared types — catches this before it causes problems.

Pain point 3: Cardinality assumptions break joins and aggregations.

If you're joining on a "unique" identifier column, you need to know it's actually unique. If it's not — if there are 5,000 duplicate IDs in a 200,000 row dataset — your join creates a Cartesian product and your aggregations are wildly wrong. Without profiling, you discover this either in a code review that catches the problem, or in a production metric that's been wrong for two weeks.

Pain point 4: Distributions reveal domain problems.

You're ingesting sales data. After profiling, you notice that 40% of order amounts are zero. The API documentation doesn't mention zero-amount orders. Are these legitimate? Cancellations? Test data that didn't get filtered? You don't know — but you know you have a question to answer before you load this into production. Without profiling, those zero-amount orders load silently and distort every revenue metric downstream.

Pain point 5: PII appears where you didn't expect it.

A "comments" field that was supposed to contain free text product feedback turns out to contain customer email addresses, phone numbers, and in several thousand rows, full names combined with account numbers. This is a compliance issue. Profiling that includes PII detection catches it before the data goes anywhere sensitive.

The compounding effect of these issues is significant. Data quality problems are exponentially cheaper to fix at ingestion than at reporting. Automated data profiling is the gate that catches them early.


How Teams Currently Approach Data Profiling

The standard toolkit for data profiling combines a few approaches, each with meaningful limitations.

pandas profiling (ydata-profiling):

The most common Python approach. You run ProfileReport(df) and get a detailed HTML report with distributions, correlations, null rates, and more. It's genuinely useful — when you have the data already loaded into a DataFrame, have Python set up, have memory available for the report generation (which can be slow on large datasets), and have time to read through the HTML output.

The limitation is that it's a local, manual step. You run it once, read it, and it doesn't integrate into your data pipeline. If the data changes next week and you pull a fresh export, you're not automatically re-profiling. It's a one-time artifact, not a continuous monitoring system.

Great Expectations:

A more sophisticated framework for data quality. You define expectations — "this column should never be null," "values should be between 0 and 100," "this column should have fewer than 100 distinct values" — and run them against your dataset. Great Expectations is powerful for enforcing known quality rules.

The limitation is that it requires you to know what rules to define. That's the profiling step — and Great Expectations doesn't do it for you. You still need to understand the data well enough to write meaningful expectations, which means you still need to profile it first.

SQL queries:

Experienced data engineers write ad-hoc SQL to profile data: COUNT(*) for nulls, COUNT(DISTINCT column) for cardinality, MIN/MAX for range, GROUP BY with HAVING COUNT(*) > 1 for duplicates. This works, produces precise answers, and can be scripted.

The limitation is time and coverage. Writing thorough profiling SQL for a 40-column table takes an hour. Doing it across ten tables from a new data source takes a full day. And it's expert work — junior analysts may not know which profiling queries to run or how to interpret the results.

Manual review:

Looking at the data directly. Useful for qualitative understanding. Doesn't scale. Misses systematic problems in large datasets.

The common thread: existing approaches are either powerful but labor-intensive, or automated but limited in scope. What's missing is automated profiling that runs without manual setup, covers every column comprehensively, and surfaces results in a queryable format.


Automated Data Profiling That Actually Works

The right approach to data profiling is one that happens automatically, covers every column, and doesn't require you to know what problems to look for in advance.

Harbinger Explorer profiles every column automatically as part of its data ingestion workflow. When you add a data source and the AI Crawler collects the data, profiling runs immediately — before you write a single query.

What gets profiled for every column:

  • Data type — the actual type of values present, not the declared schema type
  • Null rate — percentage of rows where this column is null or empty
  • Cardinality — count of distinct values, indicating whether a column is a key, a category, or a free-text field
  • Value distribution — for numeric columns, the mean, median, standard deviation, min, max, and percentile distribution
  • Most common values — for categorical columns, the top values by frequency with their counts
  • Pattern detection — identification of mixed formats, suspicious outliers, or anomalous value patterns
  • PII signals — detection of patterns consistent with email addresses, phone numbers, names, and other personally identifiable information

This runs automatically. You don't write any configuration. You don't specify which columns to check. Every column gets the full treatment.

Results in a queryable format:

Profiling results in Harbinger Explorer aren't just a static report — they're queryable through DuckDB SQL. You can write queries like: "Show me all columns with null rate above 5%." "Which columns have fewer than 10 distinct values?" "List all columns where PII was detected." This lets you programmatically incorporate profiling results into your data governance workflows.

Column Mapping with profiling context:

Harbinger Explorer's Column Mapping feature uses profiling information to help you map fields across sources. If two columns from different APIs have similar name, type, and value distributions, the tool suggests they might be the same concept — accelerating the schema mapping work that comes after profiling.

Continuous profiling with recrawling:

Data changes. A column that had 0% nulls last month might have 15% nulls today because an upstream system changed its behavior. On the Pro plan, Harbinger Explorer's recrawling feature re-profiles data sources on a schedule — so you know when data quality changes, not just when it was first ingested.


Step-by-Step: Automated Data Profiling with Harbinger Explorer

Step 1: Add your data source.

In Harbinger Explorer, click "Add Source" and configure your data source — an API endpoint, a documentation URL that leads to downloadable data, or a structured data feed. The crawler handles fetching.

Step 2: Run the AI Crawler.

The crawler fetches the data and automatically runs profiling on every column in every table or response structure it finds. This happens in the background — you don't initiate profiling separately.

Step 3: Review the profiling summary.

When the crawl completes, the profiling summary is immediately available. You see a column-by-column overview: data types, null rates, cardinality, and any flags for potential quality issues or PII. The summary is designed for fast review — you can scan it in minutes to get a complete picture of data quality.

Step 4: Query for specific issues.

Use the DuckDB SQL interface to query the profiling results. Ask targeted questions: "Which columns have more than 10% null values?" "Are there any columns with only one distinct value?" "Show me the distribution of the revenue column." Get precise answers in seconds.

Step 5: Investigate flagged columns.

For columns with quality issues, click through to see the underlying data. Sample the problematic rows, understand the pattern, and make an informed decision about how to handle them downstream — before they load into production.

Step 6: Document and share.

Use Harbinger Explorer's sharing features to share profiling results with data consumers, data owners, or stakeholders who need to understand the data before using it. Non-technical users can read the profiling summary without needing to run any queries themselves.


Try it yourselfStart exploring for free. No credit card. 8 demo data sources ready to query.


Advanced: Profiling at Scale and for Compliance

Cross-source data quality comparison:

When you're consolidating data from multiple sources into a unified dataset, profiling helps you understand quality differences before you merge. Source A might have 2% nulls in the customer ID column; Source B might have 18%. That difference matters for how you handle the merge, and you want to know it before writing the join logic, not after.

PII Detection for compliance:

Harbinger Explorer's PII Detection runs as part of every profiling pass. Columns containing email patterns, phone number patterns, name patterns, or other PII signals are flagged automatically. This is not a replacement for a formal data classification process, but it's an extremely effective first pass that catches obvious PII exposure before data moves to downstream systems.

For teams subject to GDPR, CCPA, or similar regulations, having systematic PII detection in the data ingestion workflow is both a compliance advantage and a risk reduction measure. The alternative — discovering PII in a column named "comments" after the data has already been replicated to a production warehouse — is the kind of incident that generates regulatory notifications.

Governance integration:

The profiling results stored in Harbinger Explorer serve as a data catalog artifact. You have a timestamped record of what the data looked like when it was ingested — its quality profile, its column structure, its PII flags. This is exactly the kind of documentation that data governance programs require, and it's generated automatically rather than requiring engineers to fill in metadata forms manually.

Profiling as a prerequisite to trust:

The most important use of automated data profiling is cultural: it establishes that data quality assessment is not optional. Before any dataset gets used in production reporting or machine learning, it gets profiled. Before any new data source gets integrated, it gets characterized. The profiling step is the gate.

Harbinger Explorer makes this gate essentially free — it happens automatically, adds no meaningful time to the workflow, and requires no specialized skills. There's no longer a good reason to skip profiling. There's just the choice to know what you're working with before you commit to it.


How It Compares

Profiling TaskManual ApproachHarbinger Explorer
Time to profile a 40-column dataset1–4 hoursAutomatic, seconds
CoverageDepends on analyst thoroughnessEvery column, every time
Null rate detectionManual query per columnAutomatic for all columns
Cardinality analysisManual query per columnAutomatic for all columns
PII detectionRequires separate tooling/manual reviewBuilt-in, runs automatically
Queryable resultsNot natively — static reportDuckDB SQL on profiling data
Continuous re-profilingRequires re-running scriptsScheduled recrawl (Pro)
Non-technical sharingExport HTML, manual formattingBuilt-in sharing, clean summaries

Pricing: Starter at €8/month (25 chats/day, 10 crawls/month) or Pro at €24/month (200 chats/day, 100 crawls/month, recrawling, priority support). See pricing →

Free 7-day trial, no credit card required. Start free →


FAQ

Does automated profiling replace data quality testing frameworks like Great Expectations?

They serve complementary purposes. Automated profiling in Harbinger Explorer helps you understand what the data looks like — discovering quality issues you didn't know to look for. Frameworks like Great Expectations let you enforce quality rules you've explicitly defined. Use profiling to discover, use rule frameworks to enforce. They work well together.

How does PII detection work? Does it read the actual data values?

PII detection uses pattern matching on actual column values — checking whether values match patterns consistent with email addresses, phone numbers, social security numbers, and similar identifiers. This means the crawler does process row-level data to run pattern detection. All data is processed in your account and is not shared. You can review our privacy policy for details on data handling.

Can I profile data from internal databases, not just external APIs?

Harbinger Explorer is currently focused on external API data sources. For internal database profiling, dedicated data catalog tools like Alation, Atlan, or open-source options like OpenMetadata are designed for that use case.

What if a column has mixed types — some numbers, some strings?

Mixed-type columns are flagged in the profiling output. You'll see the distribution of value types within the column, which tells you both the extent of the problem and which type is dominant. This is one of the most valuable things profiling catches — columns where the schema says one thing and the actual data says another.


Conclusion

Data profiling is not optional. Every dataset you ingest from an external source has quality characteristics you need to understand before you put it to work. The null rates, the cardinality, the distributions, the PII signals — these are facts about your data that determine what you can do with it and what risks you're taking if you don't address them.

Manual profiling is slow, incomplete, and doesn't scale. Harbinger Explorer's automated profiling changes the default: every data source gets profiled automatically, every column gets characterized, and the results are immediately queryable. You walk into every new dataset knowing exactly what you're working with.

Trust your data because you've verified it — not because you hope it's clean.


Ready to know your data before you trust it? Try Harbinger Explorer free →


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...