Harbinger Explorer

Back to Knowledge Hub
solutions
Published:

Automated Data Profiling Without Python: A Practical Guide

8 min read·Tags: data profiling, data quality, no-code, automation, analysts

Automated Data Profiling Without Python: A Practical Guide

Before you trust any dataset, you need to understand it. What fields does it have? Which ones are null half the time? Are there obvious outliers? Do date ranges actually match what the documentation claims? Is the data type for "country code" actually a string, or did someone store it as an integer somewhere?

This process — data profiling — is the unglamorous but essential first step of every serious analysis. And yet, most data professionals either skip it (and pay the price later) or spend hours setting it up in Python before they can even start the real work.

There's a better way.


What Data Profiling Actually Is

Data profiling is the process of examining a dataset to summarize its structure, quality, and content. At minimum, a useful data profile tells you:

  • Schema: What fields exist, and what are their types?
  • Completeness: For each field, what percentage of values are non-null?
  • Distributions: For numeric fields, what are the min, max, mean, and percentiles?
  • Cardinality: For categorical fields, how many unique values exist? What are the most common ones?
  • Temporal coverage: For date fields, what's the actual date range in the data (vs. what the docs claim)?
  • Anomalies: Are there values that look like data entry errors, encoding issues, or outliers?

Without this information, you're building analysis on a foundation you can't see. You don't know if your "GDP" column is 100% populated or 40% null. You don't know if that "country" field has 195 clean values or 230 messy variations including "USA", "US", "United States", and "U.S.A." You won't find out until your analysis produces unexpected results, at which point you have to go back and fix the foundation.


The Traditional Profiling Workflow

The standard approach to data profiling in 2024 is Python-based, using libraries like pandas-profiling (now ydata-profiling), great_expectations, or just raw pandas.

Here's what that looks like in practice:

Step 1: Get the data into Python. This means writing an ingestion script for your API of choice, handling authentication, pagination, error handling, and rate limits. Even for a clean API with good documentation, this takes 1–3 hours.

Step 2: Set up your environment. Install the right libraries, manage virtual environments, deal with dependency conflicts. If you're on a new machine or a client's machine, add another 30–60 minutes.

Step 3: Write or run the profiling code. pandas-profiling generates a nice report with one function call — but you still have to write the code to load the data and call the function. great_expectations gives you more control but requires writing expectation definitions. Either way, you're writing code.

Step 4: Interpret the output. pandas-profiling generates a beautiful HTML report, but it's dense. Navigating it, understanding what's important, and identifying the issues that actually matter for your use case takes time.

Total time from "I want to understand this dataset" to "I understand this dataset": 3–6 hours minimum.

For a data engineer who does this daily, the setup time amortizes over hundreds of projects. For a freelance analyst starting a new project, a bootcamp grad who learned pandas last month, or an internal analyst at a company that doesn't have a data engineering team — this is a significant barrier.


The Alternative: Profiling as a Built-In Feature

What if profiling wasn't something you had to set up, but something that happened automatically every time you pulled data?

That's the design philosophy behind Harbinger Explorer's data profiling. When you pull data from any source in Harbinger's catalog, the platform automatically runs a profiling pass and surfaces the results in the UI. No setup, no code, no installation.

Here's what you see automatically:

Schema Summary

Every field in the dataset, its inferred type (string, integer, float, date, boolean), and whether Harbinger has detected any type inconsistencies (e.g., a field that's sometimes a string and sometimes null vs. sometimes an integer).

Completeness Report

For each field: total records, non-null count, null count, and null percentage. Sorted by null rate so you immediately see which fields are the most problematic. Color-coded so you can spot a field that's 60% null in half a second.

Distribution Summary

For numeric fields: min, max, mean, median, standard deviation, and a sparkline histogram. For categorical fields: cardinality (unique value count) and the top 5 most common values with their frequencies.

Temporal Coverage

For date/timestamp fields: earliest and latest values, plus any detected gaps in the time series. If the API documentation says "data goes back to 2000" but the earliest record in the data is 2005, Harbinger flags this.

Anomaly Highlights

Harbinger's AI layer scans for common data quality issues: impossible values (negative population counts, future dates in historical data), suspicious outliers, encoding artifacts (strange Unicode characters in text fields), and duplicate records.

All of this appears within seconds of pulling the data, with zero configuration required.


Before/After: A Real Profiling Scenario

Scenario: You're an analyst at a policy research institute. You've been asked to build a model predicting economic growth in emerging markets. You've identified 4 potential data sources for your input features. You need to understand what's actually in each one before committing to any.

Before Harbinger (traditional workflow):

Day 1 morning: Set up Python environment, write ingestion scripts for all 4 APIs. Hit a rate limit on API 2. Debug authentication issues with API 3. Get basic ingestion working by end of day: 6 hours.

Day 1 afternoon: Run pandas-profiling on each dataset. API 4 has a weird encoding issue that crashes the profiler. Debug it. Re-run. Read through 4 HTML reports, each 50+ pages. Take notes on the key issues: 3 hours.

Summary of findings: API 1 has 35% null rate on the key field you need. API 2 is fine but only covers 25 countries. API 3 has country codes in a non-standard format. API 4 is the best option.

Total: 9 hours to reach a dataset selection decision.

After Harbinger:

Morning: Open Harbinger, search for your 4 APIs in the catalog. Pull sample data for each using natural language queries. View the automatic profiling results side by side.

15 minutes in: You can immediately see that API 1 has 35% null rate on the key field. API 2 covers only 25 countries. API 3 has a known country code formatting issue (documented in the catalog). API 4 is clean and has the broadest coverage.

Decision: API 4. Total time: 20 minutes.

The time savings aren't marginal. They're transformative — the difference between a half-day task and a coffee break.


Profiling in Harbinger's SQL Layer

For more advanced profiling needs, Harbinger's built-in DuckDB engine lets you run custom profiling queries directly in the browser. This is useful when you need to go beyond the automatic profile — for example, profiling conditional distributions, checking correlations between fields, or validating specific business rules.

Some useful patterns:

Check completeness across all fields at once:

SELECT 
  COUNT(*) as total_rows,
  COUNT(gdp) as gdp_not_null,
  COUNT(inflation) as inflation_not_null,
  COUNT(unemployment) as unemployment_not_null,
  ROUND(100.0 * COUNT(gdp) / COUNT(*), 1) as gdp_pct,
  ROUND(100.0 * COUNT(inflation) / COUNT(*), 1) as inflation_pct
FROM economic_data

Find unexpected values in categorical fields:

SELECT country_code, COUNT(*) as n
FROM economic_data
GROUP BY country_code
ORDER BY country_code
-- Look for: typos, mixed formats (USA vs US vs United States), blanks

Profile temporal coverage:

SELECT 
  MIN(date) as earliest,
  MAX(date) as latest,
  COUNT(DISTINCT date) as distinct_dates,
  DATEDIFF('month', MIN(date), MAX(date)) + 1 as expected_months,
  COUNT(DISTINCT date) as actual_months
FROM economic_data
-- expected_months > actual_months = gaps in the time series

The combination of automatic profiling (for the 80% use case) and SQL-based profiling (for the 20% custom case) covers virtually all profiling needs without requiring you to leave the browser.


Data Profiling as a Habit, Not a Project

The reason data profiling is underutilized isn't that analysts don't understand its value. It's that when profiling requires spinning up a Python environment and writing code, it feels like a separate project — something you do formally at the start of a big engagement, not something you do every time you pull a new dataset.

When profiling is automatic and instant, it becomes a habit. You pull data, you see the profile, you immediately know whether the data is trustworthy for your purpose. It takes no extra time; it's just there.

This behavior change matters at scale. An analyst who profiles data automatically will catch data quality issues at the source — before they contaminate an analysis. An analyst who only profiles formally, at the start of major projects, will miss issues in ad-hoc pulls, quick queries, and mid-project source changes.

Harbinger is designed to make the good behavior frictionless. When the right thing to do is also the easiest thing to do, people do it.


Who Benefits Most From Automated Profiling

Freelancers: When you're starting a new client project with unfamiliar data sources, automated profiling gives you instant credibility. You can quickly identify data quality issues and present them to the client as findings, not surprises.

Bootcamp grads and junior analysts: Profiling is one of the first skills that gets taught in data courses, but the tooling is often complex. Harbinger makes it accessible from day one, without the Python setup overhead.

Research teams: Academic and policy researchers often work with public datasets that are poorly documented and inconsistently maintained. Automated profiling surfaces issues that the documentation doesn't mention.

Team leads: When you need to evaluate whether a new data source is suitable for a production pipeline, automated profiling gives you a quick quality signal without committing engineering time.


The Bottom Line

Data profiling is not optional — it's the foundation that determines whether your analysis is trustworthy or not. The question is whether it's a half-day project or a 20-minute sanity check.

With Harbinger Explorer, every data pull comes with an automatic profile. You see completeness, distributions, anomalies, and temporal coverage instantly, without writing code, without setting up libraries, without leaving the browser. And for the cases where you need to go deeper, the built-in DuckDB SQL engine gives you full profiling power without any local setup.

Stop building data quality blind. Profile everything, automatically, from the moment you touch a new dataset.


Ready to profile your data without a Python environment?

Try Harbinger Explorer free for 7 days — no credit card required. Starter plan from €8/month.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...