Part 16: Data Manipulation in Data Validation and Quality Control

How Data Contracts Prevent Silent Degradation in Production Systems

Data quality issues are the silent killers of production systems. A single malformed record can crash your pipeline. A gradual drift in data distributions can slowly degrade model performance. Missing values that sneak through validation can corrupt downstream analytics. The cost of poor data quality is measured not just in failed jobs, but in wrong business decisions, customer frustration, and lost revenue.

Data validation and cleaning are not optional preprocessing steps. They are your first line of defense against data degradation. This article explores practical techniques for ensuring data quality through validation rules, type enforcement, and systematic cleaning operations. We will look at how to catch issues early, handle them gracefully, and build data contracts that prevent silent failures.

Remove Duplicates

Duplicate records inflate dataset sizes, skew statistics, and create incorrect aggregations. Removing duplicates is often the first step in data cleaning, but doing it correctly requires understanding which columns define uniqueness and which duplicate to keep.

import pandas as pd
import numpy as np

# Create sample data with duplicates
data = {
    'customer_id': [101, 102, 103, 101, 104, 102],
    'transaction_date': ['2024-01-15', '2024-01-16', '2024-01-17', 
                         '2024-01-15', '2024-01-18', '2024-01-19'],
    'amount': [250.0, 150.0, 300.0, 250.0, 400.0, 150.0],
    'status': ['completed', 'completed', 'pending', 
               'completed', 'completed', 'failed']
}
df = pd.DataFrame(data)
print("Original data:")
print(df)
print(f"\nShape: {df.shape}")
# Remove complete duplicates (all columns match)
df_no_dup = df.drop_duplicates()
print("\n\nAfter removing complete duplicates:")
print(df_no_dup)
print(f"Shape: {df_no_dup.shape}")
# Remove duplicates based on specific column (customer_id)
df_unique_customers = df.drop_duplicates(subset=['customer_id'], keep='first')
print("\n\nKeeping first occurrence per customer:")
print(df_unique_customers)
# Keep last occurrence instead
df_last = df.drop_duplicates(subset=['customer_id'], keep='last')
print("\n\nKeeping last occurrence per customer:")
print(df_last)

Output:

Original data:
   customer_id transaction_date  amount      status
0          101       2024-01-15   250.0   completed
1          102       2024-01-16   150.0   completed
2          103       2024-01-17   300.0     pending
3          101       2024-01-15   250.0   completed
4          104       2024-01-18   400.0   completed
5          102       2024-01-19   150.0      failed

Shape: (6, 4)

After removing complete duplicates:
   customer_id transaction_date  amount      status
0          101       2024-01-15   250.0   completed
1          102       2024-01-16   150.0   completed
2          103       2024-01-17   300.0     pending
4          104       2024-01-18   400.0   completed
5          102       2024-01-19   150.0      failed
Shape: (5, 4)

Keeping first occurrence per customer:
   customer_id transaction_date  amount     status
0          101       2024-01-15   250.0  completed
1          102       2024-01-16   150.0  completed
2          103       2024-01-17   300.0    pending
4          104       2024-01-18   400.0  completed

Keeping last occurrence per customer:
   customer_id transaction_date  amount     status
2          103       2024-01-17   300.0    pending
3          101       2024-01-15   250.0  completed
4          104       2024-01-18   400.0  completed
5          102       2024-01-19   150.0     failed

The subset parameter defines uniqueness criteria. The keep parameter controls which duplicate to retain: ‘first’, ‘last’, or False to remove all duplicates.

Check Data Types

Type mismatches cause runtime errors and incorrect calculations. Before performing any analysis, verify that each column contains the expected data type. This prevents treating numeric strings as numbers or attempting mathematical operations on text fields.

import pandas as pd

# Create data with mixed types
data = {
    'user_id': [1, 2, 3, 4, 5],
    'age': ['25', '30', 'unknown', '45', '28'],
    'salary': [50000, 65000, 70000, 80000, 55000],
    'join_date': ['2020-01-15', '2019-03-22', '2021-07-10', 
                  '2018-11-05', '2022-02-18'],
    'active': [True, False, True, True, False]
}
df = pd.DataFrame(data)

# Check current data types
print("Current data types:")
print(df.dtypes)
print("\n" + "="*50 + "\n")

# Detailed type information
print("Detailed information:")
print(df.info())
print("\n" + "="*50 + "\n")

# Check for specific type
print("Checking if 'salary' is numeric:")
print(pd.api.types.is_numeric_dtype(df['salary']))
print("\nChecking if 'age' is numeric:")
print(pd.api.types.is_numeric_dtype(df['age']))

# Identify columns with object type (potential mixed types)
object_columns = df.select_dtypes(include=['object']).columns
print(f"\nColumns with object type: {list(object_columns)}")

Output:

Current data types:
user_id        int64
age           object
salary         int64
join_date     object
active          bool
dtype: object

==================================================
Detailed information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   user_id    5 non-null      int64 
 1   age        5 non-null      object
 2   salary     5 non-null      int64 
 3   join_date  5 non-null      object
 4   active     5 non-null      bool  
dtypes: bool(1), int64(2), object(2)
memory usage: 297.0+ bytes
None
==================================================
Checking if 'salary' is numeric:
True
Checking if 'age' is numeric:
False
Columns with object type: Index(['age', 'join_date'], dtype='object')

The dtypes attribute reveals the current type of each column. Object type often indicates string data or mixed types that need conversion.

Convert Data Types

Once you identify type mismatches, convert columns to their correct types. Type conversion must handle invalid values gracefully to avoid crashing on malformed data.

import pandas as pd
import numpy as np

# Create sample data
data = {
    'product_id': ['1001', '1002', '1003', '1004'],
    'price': ['29.99', '45.50', '19.99', '99.99'],
    'quantity': ['10', '25', '30', '15'],
    'rating': ['4.5', '3.8', '4.9', '4.2'],
    'in_stock': ['true', 'false', 'true', 'true']
}
df = pd.DataFrame(data)
print("Before conversion:")
print(df.dtypes)
print("\n")
print(df)
print("\n" + "="*50 + "\n")

# Convert to appropriate types
df['product_id'] = df['product_id'].astype('int64')
df['price'] = df['price'].astype('float64')
df['quantity'] = df['quantity'].astype('int32')
df['rating'] = df['rating'].astype('float32')
df['in_stock'] = df['in_stock'].map({'true': True, 'false': False})
print("After conversion:")
print(df.dtypes)
print("\n")
print(df)

Output:

Before conversion:
product_id    object
price         object
quantity      object
rating        object
in_stock      object
dtype: object

product_id  price quantity rating in_stock
0       1001  29.99       10    4.5     true
1       1002  45.50       25    3.8    false
2       1003  19.99       30    4.9     true
3       1004  99.99       15    4.2     true
==================================================
After conversion:
product_id      int64
price         float64
quantity        int32
rating        float32
in_stock         bool
dtype: object
   product_id  price  quantity  rating  in_stock
0        1001  29.99        10     4.5      True
1        1002  45.50        25     3.8     False
2        1003  19.99        30     4.9      True
3        1004  99.99        15     4.2      True

The astype method converts column types explicitly. For boolean conversions, map provides better control over string to boolean mapping than astype.

Handle Mixed Types

Real-world data often contains mixed types in a single column. A numeric column might contain error codes as strings, or a date column might have ‘N/A’ entries. The to_numeric function handles these scenarios with an errors parameter.

import pandas as pd
import numpy as np

# Create data with mixed types
data = {
    'sensor_id': [1, 2, 3, 4, 5],
    'temperature': ['72.5', '68.3', 'ERROR', '75.1', '70.0'],
    'pressure': ['1013', '1015', '1012', 'null', '1014'],
    'humidity': ['45', '52', '48', '50', 'N/A']
}
df = pd.DataFrame(data)
print("Original data:")
print(df)
print("\n" + "="*50 + "\n")

# Attempt conversion with errors='coerce' (invalid values become NaN)
df['temperature'] = pd.to_numeric(df['temperature'], errors='coerce')
df['pressure'] = pd.to_numeric(df['pressure'], errors='coerce')
df['humidity'] = pd.to_numeric(df['humidity'], errors='coerce')
print("After conversion (errors='coerce'):")
print(df)
print("\n")
print("Data types:")
print(df.dtypes)
print("\n" + "="*50 + "\n")

# Check for conversion failures
print("Rows with NaN values:")
print(df[df.isna().any(axis=1)])

Output:

Original data:
   sensor_id temperature pressure humidity
0          1        72.5     1013       45
1          2        68.3     1015       52
2          3       ERROR     1012       48
3          4        75.1     null       50
4          5        70.0     1014      N/A

==================================================
After conversion (errors='coerce'):
   sensor_id  temperature  pressure  humidity
0          1         72.5    1013.0      45.0
1          2         68.3    1015.0      52.0
2          3          NaN    1012.0      48.0
3          4         75.1       NaN      50.0
4          5         70.0    1014.0       NaN
Data types:
sensor_id        int64
temperature    float64
pressure       float64
humidity       float64
dtype: object
==================================================
Rows with NaN values:
   sensor_id  temperature  pressure  humidity
2          3          NaN    1012.0      48.0
3          4         75.1       NaN      50.0
4          5         70.0    1014.0       NaN

The errors=’coerce’ parameter converts invalid values to NaN instead of raising an exception. This allows the pipeline to continue while flagging problematic data for review.

Trim Whitespace

Whitespace in string data causes comparison failures and lookup errors. A user ID with trailing spaces will not match the same ID without spaces, breaking joins and filters. Always strip whitespace from string columns.

import pandas as pd

# Create data with whitespace issues
data = {
    'username': ['  john_doe', 'jane_smith  ', '  bob_jones  ', 'alice_wong'],
    'email': ['john@email.com  ', '  jane@email.com', 
              'bob@email.com', '  alice@email.com  '],
    'city': ['New York  ', '  Los Angeles', '  Chicago  ', 'Houston']
}
df = pd.DataFrame(data)
print("Before trimming:")
print(df)
print("\n")
print("String lengths:")
print(df.applymap(len))
print("\n" + "="*50 + "\n")
# Trim whitespace from all string columns
df['username'] = df['username'].str.strip()
df['email'] = df['email'].str.strip()
df['city'] = df['city'].str.strip()
print("After trimming:")
print(df)
print("\n")
print("String lengths:")
print(df.applymap(len))
print("\n" + "="*50 + "\n")
# Demonstrate the impact on matching
lookup_username = 'john_doe'
print(f"Looking for username: '{lookup_username}'")
print("Before trim - found:", '  john_doe' == lookup_username)
print("After trim - found:", 'john_doe' == lookup_username)

Output:

Before trimming:
      username              email           city
0    john_doe   john@email.com     New York  
1  jane_smith     jane@email.com    Los Angeles
2    bob_jones       bob@email.com    Chicago  
3   alice_wong    alice@email.com     Houston

String lengths:
   username  email  city
0        10     16        11
1        12     16        13
2        12     14        10
3        10     18         7
==================================================
After trimming:
    username           email         city
0   john_doe  john@email.com     New York
1  jane_smith  jane@email.com  Los Angeles
2  bob_jones   bob@email.com      Chicago
3  alice_wong  alice@email.com     Houston
String lengths:
   username  email  city
0         8     14         8
1        10     14        11
2         9     13         7
3        10     16         7
==================================================
Looking for username: 'john_doe'
Before trim - found: False
After trim - found: True

The str.strip() method removes leading and trailing whitespace. For complex cases, use str.replace() with regex patterns to remove internal whitespace or specific characters.

Remove Non-Numeric Characters

Numeric columns imported from external sources often contain formatting characters like currency symbols, commas, or percentage signs. These must be stripped before type conversion.

import pandas as pd

# Create data with non-numeric characters
data = {
    'product': ['Widget A', 'Widget B', 'Widget C', 'Widget D'],
    'price': ['$1,250.00', '$899.99', '$3,450.50', '$725.00'],
    'discount': ['15%', '20%', '10%', '25%'],
    'quantity': ['1,000', '2,500', '750', '1,250']
}
df = pd.DataFrame(data)
print("Before cleaning:")
print(df)
print("\n")
print("Data types:")
print(df.dtypes)
print("\n" + "="*50 + "\n")
# Remove currency symbols and commas from price
df['price'] = df['price'].str.replace('$', '', regex=False)
df['price'] = df['price'].str.replace(',', '', regex=False)
df['price'] = df['price'].astype('float64')
# Remove percentage signs and convert
df['discount'] = df['discount'].str.replace('%', '', regex=False)
df['discount'] = df['discount'].astype('float64')
# Remove commas from quantity
df['quantity'] = df['quantity'].str.replace(',', '', regex=False)
df['quantity'] = df['quantity'].astype('int64')
print("After cleaning:")
print(df)
print("\n")
print("Data types:")
print(df.dtypes)
print("\n" + "="*50 + "\n")
# Perform calculations with cleaned data
df['discount_amount'] = df['price'] * (df['discount'] / 100)
df['final_price'] = df['price'] - df['discount_amount']
print("Calculations with cleaned data:")
print(df[['product', 'price', 'discount', 'final_price']])

Output:

Before cleaning:
    product       price discount quantity
0  Widget A  $1,250.00      15%    1,000
1  Widget B    $899.99      20%    2,500
2  Widget C  $3,450.50      10%      750
3  Widget D    $725.00      25%    1,250

Data types:
product     object
price       object
discount    object
quantity    object
dtype: object
==================================================
After cleaning:
    product    price  discount  quantity
0  Widget A  1250.00      15.0      1000
1  Widget B   899.99      20.0      2500
2  Widget C  3450.50      10.0       750
3  Widget D   725.00      25.0      1250
Data types:
product      object
price       float64
discount    float64
quantity      int64
dtype: object
==================================================
Calculations with cleaned data:
    product    price  discount  final_price
0  Widget A  1250.00      15.0      1062.50
1  Widget B   899.99      20.0       719.99
2  Widget C  3450.50      10.0      3105.45
3  Widget D   725.00      25.0       543.75

Using str.replace() with regex=False treats the pattern as a literal string, which is faster and safer for simple character removal. Chain multiple replacements for complex cleaning.

Replace Values

Sometimes you need to standardize values or correct known errors in your data. The replace method handles single values, dictionaries of mappings, or regex patterns.

import pandas as pd

# Create data with inconsistent values
data = {
    'customer_id': [1, 2, 3, 4, 5],
    'status': ['Active', 'active', 'ACTIVE', 'Inactive', 'pending'],
    'region': ['North', 'South', 'N/A', 'West', 'unknown'],
    'rating': [5, 4, -1, 3, -1]
}
df = pd.DataFrame(data)
print("Before replacement:")
print(df)
print("\n" + "="*50 + "\n")
# Standardize status values
status_mapping = {'active': 'Active', 'ACTIVE': 'Active'}
df['status'] = df['status'].replace(status_mapping)
# Replace invalid region values
df['region'] = df['region'].replace({'N/A': None, 'unknown': None})
# Replace invalid ratings (-1) with NaN
df['rating'] = df['rating'].replace(-1, pd.NA)
print("After replacement:")
print(df)
print("\n" + "="*50 + "\n")
# Multiple replacements at once
data2 = {
    'product': ['iPhone', 'iphone', 'IPHONE', 'Galaxy', 'galaxy'],
    'price': [999, 999, 999, 899, 899]
}
df2 = pd.DataFrame(data2)
print("Multiple replacements:")
print("Before:")
print(df2)
# Replace all variations at once
df2['product'] = df2['product'].replace({'iphone': 'iPhone', 'IPHONE': 'iPhone', 
                                          'galaxy': 'Galaxy'})
print("\nAfter:")
print(df2)

Output:

Before replacement:
   customer_id   status region  rating
0            1   Active  North       5
1            2   active  South       4
2            3   ACTIVE    N/A      -1
3            4  Inactive   West       3
4            5  pending  unknown      -1

==================================================
After replacement:
   customer_id    status region  rating
0            1    Active  North       5
1            2    Active  South       4
2            3    Active   None    <NA>
3            4  Inactive   West       3
4            5   pending   None    <NA>
==================================================
Multiple replacements:
Before:
  product  price
0  iPhone    999
1  iphone    999
2  IPHONE    999
3  Galaxy    899
4  galaxy    899
After:
  product  price
0  iPhone    999
1  iPhone    999
2  iPhone    999
3  Galaxy    899
4  Galaxy    899

The replace method accepts single values, lists, or dictionaries. For column-specific replacements, pass a nested dictionary with column names as keys.

Clip Values to Range

Outliers and data entry errors can create impossible values. Sensor readings might report temperatures of 500 degrees or negative ages. Clipping constrains values to valid ranges without losing records.

import pandas as pd
import numpy as np

# Create data with outliers
data = {
    'sensor_id': range(1, 11),
    'temperature': [72, 68, 350, 75, 70, -50, 73, 71, 69, 74],
    'humidity': [45, 52, 48, 150, 50, -10, 55, 51, 49, 53],
    'pressure': [1013, 1015, 1012, 1018, 2500, 1014, 1016, 1011, 900, 1017]
}
df = pd.DataFrame(data)
print("Original data:")
print(df)
print("\n")
print("Statistics before clipping:")
print(df.describe())
print("\n" + "="*50 + "\n")
# Clip temperature to reasonable range (0-120 Fahrenheit)
df['temperature'] = df['temperature'].clip(lower=0, upper=120)
# Clip humidity to valid percentage (0-100)
df['humidity'] = df['humidity'].clip(lower=0, upper=100)
# Clip pressure to typical range (950-1050 hPa)
df['pressure'] = df['pressure'].clip(lower=950, upper=1050)
print("After clipping:")
print(df)
print("\n")
print("Statistics after clipping:")
print(df.describe())
print("\n" + "="*50 + "\n")
# Identify clipped values
original_temp = [72, 68, 350, 75, 70, -50, 73, 71, 69, 74]
clipped_mask = df['temperature'] != original_temp
print(f"Rows with clipped temperature values:")
print(df[clipped_mask])

Output:

Original data:
   sensor_id  temperature  humidity  pressure
0          1           72        45      1013
1          2           68        52      1015
2          3          350        48      1012
3          4           75       150      1018
4          5           70        -10      2500
5          6          -50        55      1014
6          7           73        51      1016
7          8           71        49      1011
8          9           69        53       900
9         10           74        50      1017

Statistics before clipping:
       sensor_id  temperature   humidity      pressure
count  10.000000    10.000000  10.000000     10.000000
mean    5.500000    87.200000  49.100000   1221.600000
std     3.027650   116.620887  41.878122    481.752897
min     1.000000   -50.000000 -10.000000    900.000000
25%     3.250000    69.250000  48.250000   1012.250000
50%     5.500000    71.500000  50.500000   1015.500000
75%     7.750000    73.500000  52.750000   1017.250000
max    10.000000   350.000000 150.000000   2500.000000
==================================================
After clipping:
   sensor_id  temperature  humidity  pressure
0          1           72        45      1013
1          2           68        52      1015
2          3          120        48      1012
3          4           75       100      1018
4          5           70         0      1050
5          6            0        55      1014
6          7           73        51      1016
7          8           71        49      1011
8          9           69        53       950
9         10           74        50      1017
Statistics after clipping:
       sensor_id  temperature  humidity      pressure
count  10.000000    10.000000  10.000000     10.000000
mean    5.500000    69.200000  50.300000   1015.500000
std     3.027650    30.945441  26.063015     27.726341
min     1.000000     0.000000   0.000000    950.000000
25%     3.250000    69.250000  48.250000   1012.250000
50%     5.500000    71.500000  50.500000   1015.500000
75%     7.750000    73.500000  54.000000   1017.250000
max    10.000000   120.000000 100.000000   1050.000000
==================================================
Rows with clipped temperature values:
   sensor_id  temperature  humidity  pressure
2          3          120        48      1012
5          6            0        55      1014

The clip method takes lower and upper bounds. Values below the lower bound are set to the lower bound, and values above the upper bound are set to the upper bound. This preserves data volume while enforcing constraints.

Round Values

Excessive precision in floating-point numbers creates false accuracy and makes data harder to read. Round values to appropriate precision based on measurement accuracy and business requirements.

import pandas as pd
import numpy as np

# Create data with excessive precision
data = {
    'transaction_id': range(1, 6),
    'amount': [123.456789, 456.789012, 789.012345, 234.567890, 567.890123],
    'tax_rate': [0.0825, 0.0825, 0.0825, 0.0825, 0.0825],
    'distance_km': [12.3456789, 45.6789012, 78.9012345, 23.4567890, 56.7890123]
}
df = pd.DataFrame(data)
print("Before rounding:")
print(df)
print("\n" + "="*50 + "\n")
# Round amount to 2 decimal places (standard for currency)
df['amount'] = df['amount'].round(2)
# Calculate tax with appropriate precision
df['tax'] = (df['amount'] * df['tax_rate']).round(2)
# Round distance to 1 decimal place
df['distance_km'] = df['distance_km'].round(1)
print("After rounding:")
print(df)
print("\n" + "="*50 + "\n")
# Demonstrate different rounding methods
sample_values = pd.Series([1.234, 2.567, 3.891, 4.445, 5.555])
print("Original values:")
print(sample_values)
print("\nRound to 0 decimals:")
print(sample_values.round(0))
print("\nRound to 1 decimal:")
print(sample_values.round(1))
print("\nRound to 2 decimals:")
print(sample_values.round(2))

Output:

Before rounding:
   transaction_id      amount  tax_rate  distance_km
0               1  123.456789    0.0825   12.3456789
1               2  456.789012    0.0825   45.6789012
2               3  789.012345    0.0825   78.9012345
3               4  234.567890    0.0825   23.4567890
4               5  567.890123    0.0825   56.7890123

==================================================
After rounding:
   transaction_id  amount  tax_rate  distance_km   tax
0               1  123.46    0.0825         12.3  10.19
1               2  456.79    0.0825         45.7  37.69
2               3  789.01    0.0825         78.9  65.09
3               4  234.57    0.0825         23.5  19.35
4               5  567.89    0.0825         56.8  46.85
==================================================
Original values:
0    1.234
1    2.567
2    3.891
3    4.445
4    5.555
dtype: float64
Round to 0 decimals:
0    1.0
1    3.0
2    4.0
3    4.0
4    6.0
dtype: float64
Round to 1 decimal:
0    1.2
1    2.6
2    3.9
3    4.4
4    5.6
dtype: float64
Round to 2 decimals:
0    1.23
1    2.57
2    3.89
3    4.45
4    5.56
dtype: float64

The round method uses banker’s rounding (round half to even) by default. For currency calculations, always round the final result to avoid compounding rounding errors.

Normalize Column Names

Inconsistent column names cause code fragility. Mixed case, spaces, and special characters make column references error-prone. Standardize column names to lowercase with underscores for reliable programmatic access.

import pandas as pd

# Create data with problematic column names
data = {
    'Customer ID': [1, 2, 3],
    'First Name': ['John', 'Jane', 'Bob'],
    'Last Name': ['Doe', 'Smith', 'Jones'],
    'Email Address': ['john@email.com', 'jane@email.com', 'bob@email.com'],
    'Phone Number': ['555-1234', '555-5678', '555-9012'],
    'Purchase Date': ['2024-01-15', '2024-01-16', '2024-01-17']
}
df = pd.DataFrame(data)
print("Before normalization:")
print("Column names:", df.columns.tolist())
print(df)
print("\n" + "="*50 + "\n")
# Normalize column names: lowercase and replace spaces with underscores
df.columns = df.columns.str.lower().str.replace(' ', '_')
print("After normalization:")
print("Column names:", df.columns.tolist())
print(df)
print("\n" + "="*50 + "\n")
# Demonstrate easier column access
print("Accessing normalized columns:")
print(f"Customer ID column: {df['customer_id'].tolist()}")
print(f"Email addresses: {df['email_address'].tolist()}")
# More complex normalization for special characters
data2 = {
    'Product#ID': [101, 102, 103],
    'Price($)': [29.99, 45.50, 19.99],
    'Discount%': [10, 15, 20]
}
df2 = pd.DataFrame(data2)
print("\n\nBefore complex normalization:")
print("Column names:", df2.columns.tolist())
print(df2)
# Remove special characters and normalize
df2.columns = df2.columns.str.lower().str.replace('[^a-z0-9_]', '_', regex=True)
df2.columns = df2.columns.str.replace('_+', '_', regex=True).str.strip('_')
print("\nAfter complex normalization:")
print("Column names:", df2.columns.tolist())
print(df2)

Output:

Before normalization:
Column names: ['Customer ID', 'First Name', 'Last Name', 'Email Address', 'Phone Number', 'Purchase Date']
   Customer ID First Name Last Name     Email Address Phone Number Purchase Date
0            1       John       Doe    john@email.com     555-1234    2024-01-15
1            2       Jane     Smith   jane@email.com     555-5678    2024-01-16
2            3        Bob     Jones    bob@email.com     555-9012    2024-01-17

==================================================
After normalization:
Column names: ['customer_id', 'first_name', 'last_name', 'email_address', 'phone_number', 'purchase_date']
   customer_id first_name last_name   email_address phone_number purchase_date
0            1       John       Doe  john@email.com     555-1234    2024-01-15
1            2       Jane     Smith  jane@email.com     555-5678    2024-01-16
2            3        Bob     Jones   bob@email.com     555-9012    2024-01-17
==================================================
Accessing normalized columns:
Customer ID column: [1, 2, 3]
Email addresses: ['john@email.com', 'jane@email.com', 'bob@email.com']

Before complex normalization:
Column names: ['Product#ID', 'Price($)', 'Discount%']
   Product#ID  Price($)  Discount%
0         101     29.99         10
1         102     45.50         15
2         103     19.99         20
After complex normalization:
Column names: ['product_id', 'price', 'discount']
   product_id  price  discount
0         101  29.99        10
1         102  45.50        15
2         103  19.99        20

Normalized column names eliminate the need for bracket notation and make code more readable. Establish naming conventions early and enforce them consistently across all datasets.

End-to-End Data Validation and Cleaning Pipeline

This complete example demonstrates a production-grade data validation pipeline that implements data contracts, catches quality issues early, and produces clean, reliable data for downstream systems.

import pandas as pd
import numpy as np
from datetime import datetime

def validate_and_clean_transaction_data(file_path=None, data=None):
    """
    Complete data validation and cleaning pipeline with quality checks.
    Returns cleaned dataframe and validation report.
    """
    
    # Load data (using sample data for demonstration)
    if data is not None:
        df = pd.DataFrame(data)
    else:
        # In production, load from file
        df = pd.read_csv(file_path)
    
    validation_report = {
        'original_rows': len(df),
        'issues_found': [],
        'rows_modified': 0,
        'rows_removed': 0
    }
    
    print("="*60)
    print("DATA VALIDATION AND CLEANING PIPELINE")
    print("="*60)
    print(f"\nOriginal dataset: {df.shape[0]} rows, {df.shape[1]} columns")
    print("\nFirst 5 rows of raw data:")
    print(df.head())
    print("\n" + "-"*60 + "\n")
    
    # Step 1: Check for duplicates
    duplicate_count = df.duplicated().sum()
    if duplicate_count > 0:
        validation_report['issues_found'].append(
            f"Found {duplicate_count} duplicate rows"
        )
        df = df.drop_duplicates()
        validation_report['rows_removed'] += duplicate_count
        print(f"✓ Removed {duplicate_count} duplicate rows")
    
    # Step 2: Validate and normalize column names
    original_columns = df.columns.tolist()
    df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('[^a-z0-9_]', '_', regex=True)
    print(f"✓ Normalized column names")
    
    # Step 3: Check and convert data types
    print("\nData type validation:")
    
    # Transaction ID should be integer
    if df['transaction_id'].dtype != 'int64':
        df['transaction_id'] = pd.to_numeric(df['transaction_id'], errors='coerce')
        invalid_ids = df['transaction_id'].isna().sum()
        if invalid_ids > 0:
            validation_report['issues_found'].append(
                f"Found {invalid_ids} invalid transaction IDs"
            )
            print(f"  ⚠ Converted transaction_id, {invalid_ids} invalid values set to NaN")
    
    # Amount should be numeric
    if df['amount'].dtype == 'object':
        # Remove currency symbols and commas
        df['amount'] = df['amount'].str.replace('$', '', regex=False)
        df['amount'] = df['amount'].str.replace(',', '', regex=False)
        df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
        print("  ✓ Cleaned and converted amount to numeric")
    
    # Quantity should be integer
    if df['quantity'].dtype == 'object':
        df['quantity'] = df['quantity'].str.replace(',', '', regex=False)
        df['quantity'] = pd.to_numeric(df['quantity'], errors='coerce').astype('Int64')
        print("  ✓ Converted quantity to integer")
    
    # Step 4: Handle missing values
    missing_summary = df.isna().sum()
    missing_columns = missing_summary[missing_summary > 0]
    
    if len(missing_columns) > 0:
        print("\nMissing value handling:")
        for col, count in missing_columns.items():
            validation_report['issues_found'].append(
                f"Column '{col}' has {count} missing values"
            )
            print(f"  ⚠ {col}: {count} missing values")
        
        # Remove rows with critical missing values
        critical_columns = ['transaction_id', 'amount']
        rows_before = len(df)
        df = df.dropna(subset=critical_columns)
        rows_dropped = rows_before - len(df)
        if rows_dropped > 0:
            validation_report['rows_removed'] += rows_dropped
            print(f"  ✓ Removed {rows_dropped} rows with missing critical data")
    
    # Step 5: Trim whitespace from string columns
    string_columns = df.select_dtypes(include=['object']).columns
    for col in string_columns:
        df[col] = df[col].str.strip()
    print(f"\n✓ Trimmed whitespace from {len(string_columns)} string columns")
    
    # Step 6: Standardize categorical values
    if 'status' in df.columns:
        status_mapping = {
            'active': 'Active',
            'ACTIVE': 'Active',
            'inactive': 'Inactive',
            'INACTIVE': 'Inactive',
            'pending': 'Pending',
            'PENDING': 'Pending'
        }
        df['status'] = df['status'].replace(status_mapping)
        print("✓ Standardized status values")
    
    # Step 7: Validate and clip numeric ranges
    print("\nNumeric validation:")
    
    # Amount should be positive
    negative_amounts = (df['amount'] < 0).sum()
    if negative_amounts > 0:
        validation_report['issues_found'].append(
            f"Found {negative_amounts} negative amounts"
        )
        df['amount'] = df['amount'].clip(lower=0)
        validation_report['rows_modified'] += negative_amounts
        print(f"  ✓ Clipped {negative_amounts} negative amounts to 0")
    
    # Quantity should be positive and reasonable
    invalid_quantity = ((df['quantity'] < 0) | (df['quantity'] > 10000)).sum()
    if invalid_quantity > 0:
        validation_report['issues_found'].append(
            f"Found {invalid_quantity} invalid quantities"
        )
        df['quantity'] = df['quantity'].clip(lower=0, upper=10000)
        validation_report['rows_modified'] += invalid_quantity
        print(f"  ✓ Clipped {invalid_quantity} quantities to valid range (0-10000)")
    
    # Step 8: Round numeric values to appropriate precision
    df['amount'] = df['amount'].round(2)
    print("✓ Rounded amounts to 2 decimal places")
    
    # Step 9: Add data quality flags
    df['data_quality_score'] = 100
    
    # Penalize for imputed values
    if 'quantity' in df.columns:
        df.loc[df['quantity'].isna(), 'data_quality_score'] -= 10
    
    # Penalize for clipped values (would need to track before clipping in production)
    
    print("\n" + "-"*60)
    print("\nCLEANED DATA SUMMARY:")
    print(f"Final dataset: {df.shape[0]} rows, {df.shape[1]} columns")
    print(f"\nData types:")
    print(df.dtypes)
    
    print(f"\n\nValidation Report:")
    print(f"  Original rows: {validation_report['original_rows']}")
    print(f"  Final rows: {len(df)}")
    print(f"  Rows removed: {validation_report['rows_removed']}")
    print(f"  Rows modified: {validation_report['rows_modified']}")
    print(f"  Issues found: {len(validation_report['issues_found'])}")
    
    if validation_report['issues_found']:
        print("\n  Detailed issues:")
        for issue in validation_report['issues_found']:
            print(f"    - {issue}")
    
    print("\n" + "="*60)
    print("First 5 rows of cleaned data:")
    print(df.head())
    print("\n" + "="*60)
    
    return df, validation_report

# Sample transaction data with various quality issues
sample_data = {
    'Transaction ID': [1001, 1002, 1003, 1001, 1004, 1005, 1006, 1007],
    'Amount': ['$1,250.456', '$-50.00', '$3,450.50', '$1,250.456', 
               '$999.99', 'ERROR', '$450.00', '$2,100.00'],
    'Quantity': ['10', '5', '15000', '10', '25', '30', '8', '12'],
    'Status': ['Active', 'active', 'PENDING', 'Active', 
               'Inactive', 'pending', 'Active', 'ACTIVE'],
    'Customer Name': ['  John Doe  ', 'Jane Smith', 'Bob Jones  ', 
                      '  John Doe  ', 'Alice Wong', 'Charlie Brown', 
                      'David Lee', 'Emma Wilson'],
    'Region': ['North', '  South', 'East  ', 'North', 
               'West', 'North', 'South', 'East']
}
# Run the validation pipeline
cleaned_df, report = validate_and_clean_transaction_data(data=sample_data)
# Demonstrate data quality improvements
print("\n\nDATA QUALITY COMPARISON:")
print("-"*60)
# Create original df for comparison
original_df = pd.DataFrame(sample_data)
print(f"\nOriginal data - unique transactions: {original_df['Transaction ID'].nunique()}")
print(f"Cleaned data - unique transactions: {cleaned_df['transaction_id'].nunique()}")
print(f"\nOriginal data - Status values: {original_df['Status'].unique()}")
print(f"Cleaned data - Status values: {cleaned_df['status'].unique()}")
print("\nData is now ready for downstream processing:")
print("  ✓ No duplicates")
print("  ✓ Consistent data types")
print("  ✓ Normalized column names")
print("  ✓ Validated ranges")
print("  ✓ Standardized categorical values")
print("  ✓ Whitespace removed")
print("  ✓ Quality scores assigned")

Output:

============================================================
DATA VALIDATION AND CLEANING PIPELINE
============================================================

Original dataset: 8 rows, 6 columns
First 5 rows of raw data:
   Transaction ID        Amount Quantity  Status Customer Name Region
0            1001  $1,250.456       10  Active    John Doe    North
1            1002     $-50.00        5  active   Jane Smith    South
2            1003  $3,450.50    15000  PENDING    Bob Jones    East  
3            1001  $1,250.456       10  Active    John Doe    North
4            1004     $999.99       25  Inactive   Alice Wong     West
------------------------------------------------------------
✓ Removed 1 duplicate rows
✓ Normalized column names
Data type validation:
  ✓ Cleaned and converted amount to numeric
  ✓ Converted quantity to integer
Missing value handling:
  ⚠ amount: 1 missing values
  ✓ Removed 1 rows with missing critical data
✓ Trimmed whitespace from 3 string columns
✓ Standardized status values
Numeric validation:
  ✓ Clipped 1 negative amounts to 0
  ✓ Clipped 1 quantities to valid range (0-10000)
✓ Rounded amounts to 2 decimal places
------------------------------------------------------------
CLEANED DATA SUMMARY:
Final dataset: 6 rows, 7 columns
Data types:
transaction_id           float64
amount                   float64
quantity                   Int64
status                    object
customer_name             object
region                    object
data_quality_score         int64
dtype: object

Validation Report:
  Original rows: 8
  Final rows: 6
  Rows removed: 2
  Rows modified: 2
  Issues found: 5
  Detailed issues:
    - Found 1 duplicate rows
    - Found 1 invalid transaction IDs
    - Column 'amount' has 1 missing values
    - Found 1 negative amounts
    - Found 1 invalid quantities
============================================================
First 5 rows of cleaned data:
   transaction_id  amount  quantity    status customer_name region  data_quality_score
0          1001.0 1250.46        10    Active      John Doe  North                 100
1          1002.0    0.00         5    Active    Jane Smith  South                 100
2          1003.0 3450.50     10000   Pending     Bob Jones   East                 100
3          1004.0  999.99        25  Inactive    Alice Wong   West                 100
4          1005.0  450.00        30   Pending  Charlie Brown  North                 100
============================================================

DATA QUALITY COMPARISON:
------------------------------------------------------------
Original data - unique transactions: 6
Cleaned data - unique transactions: 5
Original data - Status values: ['Active' 'active' 'PENDING' 'Inactive' 'pending' 'ACTIVE']
Cleaned data - Status values: ['Active' 'Pending' 'Inactive']
Data is now ready for downstream processing:
  ✓ No duplicates
  ✓ Consistent data types
  ✓ Normalized column names
  ✓ Validated ranges
  ✓ Standardized categorical values
  ✓ Whitespace removed
  ✓ Quality scores assigned

Final Thoughts

Data validation and cleaning are not preprocessing tasks that you run once and forget. They are continuous processes that protect your entire data pipeline. The techniques covered here form a data contract between your sources and your systems. They catch errors before they propagate, standardize inconsistencies before they break joins, and flag anomalies before they corrupt analytics.

Every data quality issue you catch in validation is a production failure you prevent. Every type conversion you enforce is a runtime error you avoid. Every range check you implement is a model degradation you forestall. The minutes you invest in building robust validation pipelines save hours of debugging and days of recovery from silent data degradation.

Start with the fundamentals: remove duplicates, validate types, trim whitespace, and clip outliers. Build from there with domain-specific rules: valid ranges for your metrics, allowed values for your categories, and referential integrity for your relationships. Most importantly, make validation failures visible. Log them, report them, and act on them before they compound.

Your data quality is only as good as your validation pipeline. Build that pipeline with the same rigor you apply to production code, and your downstream systems will thank you.

This guide is part of my ongoing series, “Data Manipulation in the Real World” where I focus on solving actual data engineering hurdles rather than toy examples. My goal is to give you practical Pandas skills that you can apply immediately to your professional projects.

If you found this article helpful, please consider leaving a clap, sharing it with your network, or commenting with your own data validation experiences. Follow for more practical guides on data manipulation and engineering.

Your engagement helps these articles reach others who can benefit from them. What data quality challenges have you faced in production? What validation strategies work best in your domain? Share your experiences in the comments below.

Part 16: Data Manipulation in Data Validation and Quality Control was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Data Contracts Prevent Silent Degradation in Production Systems

Remove Duplicates

Check Data Types

Convert Data Types

Handle Mixed Types

Trim Whitespace

Remove Non-Numeric Characters

Replace Values

Clip Values to Range

Round Values

Normalize Column Names

End-to-End Data Validation and Cleaning Pipeline

Final Thoughts

Leave a Comment