How Data Contracts Prevent Silent Degradation in Production Systems

Data quality issues are the silent killers of production systems. A single malformed record can crash your pipeline. A gradual drift in data distributions can slowly degrade model performance. Missing values that sneak through validation can corrupt downstream analytics. The cost of poor data quality is measured not just in failed jobs, but in wrong business decisions, customer frustration, and lost revenue.
Data validation and cleaning are not optional preprocessing steps. They are your first line of defense against data degradation. This article explores practical techniques for ensuring data quality through validation rules, type enforcement, and systematic cleaning operations. We will look at how to catch issues early, handle them gracefully, and build data contracts that prevent silent failures.
Remove Duplicates
Duplicate records inflate dataset sizes, skew statistics, and create incorrect aggregations. Removing duplicates is often the first step in data cleaning, but doing it correctly requires understanding which columns define uniqueness and which duplicate to keep.
import pandas as pd
import numpy as np
# Create sample data with duplicates
data = {
'customer_id': [101, 102, 103, 101, 104, 102],
'transaction_date': ['2024-01-15', '2024-01-16', '2024-01-17',
'2024-01-15', '2024-01-18', '2024-01-19'],
'amount': [250.0, 150.0, 300.0, 250.0, 400.0, 150.0],
'status': ['completed', 'completed', 'pending',
'completed', 'completed', 'failed']
}
df = pd.DataFrame(data)
print("Original data:")
print(df)
print(f"\nShape: {df.shape}")
# Remove complete duplicates (all columns match)
df_no_dup = df.drop_duplicates()
print("\n\nAfter removing complete duplicates:")
print(df_no_dup)
print(f"Shape: {df_no_dup.shape}")
# Remove duplicates based on specific column (customer_id)
df_unique_customers = df.drop_duplicates(subset=['customer_id'], keep='first')
print("\n\nKeeping first occurrence per customer:")
print(df_unique_customers)
# Keep last occurrence instead
df_last = df.drop_duplicates(subset=['customer_id'], keep='last')
print("\n\nKeeping last occurrence per customer:")
print(df_last)
Output:
Original data:
customer_id transaction_date amount status
0 101 2024-01-15 250.0 completed
1 102 2024-01-16 150.0 completed
2 103 2024-01-17 300.0 pending
3 101 2024-01-15 250.0 completed
4 104 2024-01-18 400.0 completed
5 102 2024-01-19 150.0 failed
Shape: (6, 4)
After removing complete duplicates:
customer_id transaction_date amount status
0 101 2024-01-15 250.0 completed
1 102 2024-01-16 150.0 completed
2 103 2024-01-17 300.0 pending
4 104 2024-01-18 400.0 completed
5 102 2024-01-19 150.0 failed
Shape: (5, 4)
Keeping first occurrence per customer:
customer_id transaction_date amount status
0 101 2024-01-15 250.0 completed
1 102 2024-01-16 150.0 completed
2 103 2024-01-17 300.0 pending
4 104 2024-01-18 400.0 completed
Keeping last occurrence per customer:
customer_id transaction_date amount status
2 103 2024-01-17 300.0 pending
3 101 2024-01-15 250.0 completed
4 104 2024-01-18 400.0 completed
5 102 2024-01-19 150.0 failed
The subset parameter defines uniqueness criteria. The keep parameter controls which duplicate to retain: ‘first’, ‘last’, or False to remove all duplicates.
Check Data Types
Type mismatches cause runtime errors and incorrect calculations. Before performing any analysis, verify that each column contains the expected data type. This prevents treating numeric strings as numbers or attempting mathematical operations on text fields.
import pandas as pd
# Create data with mixed types
data = {
'user_id': [1, 2, 3, 4, 5],
'age': ['25', '30', 'unknown', '45', '28'],
'salary': [50000, 65000, 70000, 80000, 55000],
'join_date': ['2020-01-15', '2019-03-22', '2021-07-10',
'2018-11-05', '2022-02-18'],
'active': [True, False, True, True, False]
}
df = pd.DataFrame(data)
# Check current data types
print("Current data types:")
print(df.dtypes)
print("\n" + "="*50 + "\n")
# Detailed type information
print("Detailed information:")
print(df.info())
print("\n" + "="*50 + "\n")
# Check for specific type
print("Checking if 'salary' is numeric:")
print(pd.api.types.is_numeric_dtype(df['salary']))
print("\nChecking if 'age' is numeric:")
print(pd.api.types.is_numeric_dtype(df['age']))
# Identify columns with object type (potential mixed types)
object_columns = df.select_dtypes(include=['object']).columns
print(f"\nColumns with object type: {list(object_columns)}")
Output:
Current data types:
user_id int64
age object
salary int64
join_date object
active bool
dtype: object
==================================================
Detailed information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 5 non-null int64
1 age 5 non-null object
2 salary 5 non-null int64
3 join_date 5 non-null object
4 active 5 non-null bool
dtypes: bool(1), int64(2), object(2)
memory usage: 297.0+ bytes
None
==================================================
Checking if 'salary' is numeric:
True
Checking if 'age' is numeric:
False
Columns with object type: Index(['age', 'join_date'], dtype='object')
The dtypes attribute reveals the current type of each column. Object type often indicates string data or mixed types that need conversion.
Convert Data Types
Once you identify type mismatches, convert columns to their correct types. Type conversion must handle invalid values gracefully to avoid crashing on malformed data.
import pandas as pd
import numpy as np
# Create sample data
data = {
'product_id': ['1001', '1002', '1003', '1004'],
'price': ['29.99', '45.50', '19.99', '99.99'],
'quantity': ['10', '25', '30', '15'],
'rating': ['4.5', '3.8', '4.9', '4.2'],
'in_stock': ['true', 'false', 'true', 'true']
}
df = pd.DataFrame(data)
print("Before conversion:")
print(df.dtypes)
print("\n")
print(df)
print("\n" + "="*50 + "\n")
# Convert to appropriate types
df['product_id'] = df['product_id'].astype('int64')
df['price'] = df['price'].astype('float64')
df['quantity'] = df['quantity'].astype('int32')
df['rating'] = df['rating'].astype('float32')
df['in_stock'] = df['in_stock'].map({'true': True, 'false': False})
print("After conversion:")
print(df.dtypes)
print("\n")
print(df)
Output:
Before conversion:
product_id object
price object
quantity object
rating object
in_stock object
dtype: object
product_id price quantity rating in_stock
0 1001 29.99 10 4.5 true
1 1002 45.50 25 3.8 false
2 1003 19.99 30 4.9 true
3 1004 99.99 15 4.2 true
==================================================
After conversion:
product_id int64
price float64
quantity int32
rating float32
in_stock bool
dtype: object
product_id price quantity rating in_stock
0 1001 29.99 10 4.5 True
1 1002 45.50 25 3.8 False
2 1003 19.99 30 4.9 True
3 1004 99.99 15 4.2 True
The astype method converts column types explicitly. For boolean conversions, map provides better control over string to boolean mapping than astype.
Handle Mixed Types
Real-world data often contains mixed types in a single column. A numeric column might contain error codes as strings, or a date column might have ‘N/A’ entries. The to_numeric function handles these scenarios with an errors parameter.
import pandas as pd
import numpy as np
# Create data with mixed types
data = {
'sensor_id': [1, 2, 3, 4, 5],
'temperature': ['72.5', '68.3', 'ERROR', '75.1', '70.0'],
'pressure': ['1013', '1015', '1012', 'null', '1014'],
'humidity': ['45', '52', '48', '50', 'N/A']
}
df = pd.DataFrame(data)
print("Original data:")
print(df)
print("\n" + "="*50 + "\n")
# Attempt conversion with errors='coerce' (invalid values become NaN)
df['temperature'] = pd.to_numeric(df['temperature'], errors='coerce')
df['pressure'] = pd.to_numeric(df['pressure'], errors='coerce')
df['humidity'] = pd.to_numeric(df['humidity'], errors='coerce')
print("After conversion (errors='coerce'):")
print(df)
print("\n")
print("Data types:")
print(df.dtypes)
print("\n" + "="*50 + "\n")
# Check for conversion failures
print("Rows with NaN values:")
print(df[df.isna().any(axis=1)])
Output:
Original data:
sensor_id temperature pressure humidity
0 1 72.5 1013 45
1 2 68.3 1015 52
2 3 ERROR 1012 48
3 4 75.1 null 50
4 5 70.0 1014 N/A
==================================================
After conversion (errors='coerce'):
sensor_id temperature pressure humidity
0 1 72.5 1013.0 45.0
1 2 68.3 1015.0 52.0
2 3 NaN 1012.0 48.0
3 4 75.1 NaN 50.0
4 5 70.0 1014.0 NaN
Data types:
sensor_id int64
temperature float64
pressure float64
humidity float64
dtype: object
==================================================
Rows with NaN values:
sensor_id temperature pressure humidity
2 3 NaN 1012.0 48.0
3 4 75.1 NaN 50.0
4 5 70.0 1014.0 NaN
The errors=’coerce’ parameter converts invalid values to NaN instead of raising an exception. This allows the pipeline to continue while flagging problematic data for review.
Trim Whitespace
Whitespace in string data causes comparison failures and lookup errors. A user ID with trailing spaces will not match the same ID without spaces, breaking joins and filters. Always strip whitespace from string columns.
import pandas as pd
# Create data with whitespace issues
data = {
'username': [' john_doe', 'jane_smith ', ' bob_jones ', 'alice_wong'],
'email': ['john@email.com ', ' jane@email.com',
'bob@email.com', ' alice@email.com '],
'city': ['New York ', ' Los Angeles', ' Chicago ', 'Houston']
}
df = pd.DataFrame(data)
print("Before trimming:")
print(df)
print("\n")
print("String lengths:")
print(df.applymap(len))
print("\n" + "="*50 + "\n")
# Trim whitespace from all string columns
df['username'] = df['username'].str.strip()
df['email'] = df['email'].str.strip()
df['city'] = df['city'].str.strip()
print("After trimming:")
print(df)
print("\n")
print("String lengths:")
print(df.applymap(len))
print("\n" + "="*50 + "\n")
# Demonstrate the impact on matching
lookup_username = 'john_doe'
print(f"Looking for username: '{lookup_username}'")
print("Before trim - found:", ' john_doe' == lookup_username)
print("After trim - found:", 'john_doe' == lookup_username)
Output:
Before trimming:
username email city
0 john_doe john@email.com New York
1 jane_smith jane@email.com Los Angeles
2 bob_jones bob@email.com Chicago
3 alice_wong alice@email.com Houston
String lengths:
username email city
0 10 16 11
1 12 16 13
2 12 14 10
3 10 18 7
==================================================
After trimming:
username email city
0 john_doe john@email.com New York
1 jane_smith jane@email.com Los Angeles
2 bob_jones bob@email.com Chicago
3 alice_wong alice@email.com Houston
String lengths:
username email city
0 8 14 8
1 10 14 11
2 9 13 7
3 10 16 7
==================================================
Looking for username: 'john_doe'
Before trim - found: False
After trim - found: True
The str.strip() method removes leading and trailing whitespace. For complex cases, use str.replace() with regex patterns to remove internal whitespace or specific characters.
Remove Non-Numeric Characters
Numeric columns imported from external sources often contain formatting characters like currency symbols, commas, or percentage signs. These must be stripped before type conversion.
import pandas as pd
# Create data with non-numeric characters
data = {
'product': ['Widget A', 'Widget B', 'Widget C', 'Widget D'],
'price': ['$1,250.00', '$899.99', '$3,450.50', '$725.00'],
'discount': ['15%', '20%', '10%', '25%'],
'quantity': ['1,000', '2,500', '750', '1,250']
}
df = pd.DataFrame(data)
print("Before cleaning:")
print(df)
print("\n")
print("Data types:")
print(df.dtypes)
print("\n" + "="*50 + "\n")
# Remove currency symbols and commas from price
df['price'] = df['price'].str.replace('$', '', regex=False)
df['price'] = df['price'].str.replace(',', '', regex=False)
df['price'] = df['price'].astype('float64')
# Remove percentage signs and convert
df['discount'] = df['discount'].str.replace('%', '', regex=False)
df['discount'] = df['discount'].astype('float64')
# Remove commas from quantity
df['quantity'] = df['quantity'].str.replace(',', '', regex=False)
df['quantity'] = df['quantity'].astype('int64')
print("After cleaning:")
print(df)
print("\n")
print("Data types:")
print(df.dtypes)
print("\n" + "="*50 + "\n")
# Perform calculations with cleaned data
df['discount_amount'] = df['price'] * (df['discount'] / 100)
df['final_price'] = df['price'] - df['discount_amount']
print("Calculations with cleaned data:")
print(df[['product', 'price', 'discount', 'final_price']])
Output:
Before cleaning:
product price discount quantity
0 Widget A $1,250.00 15% 1,000
1 Widget B $899.99 20% 2,500
2 Widget C $3,450.50 10% 750
3 Widget D $725.00 25% 1,250
Data types:
product object
price object
discount object
quantity object
dtype: object
==================================================
After cleaning:
product price discount quantity
0 Widget A 1250.00 15.0 1000
1 Widget B 899.99 20.0 2500
2 Widget C 3450.50 10.0 750
3 Widget D 725.00 25.0 1250
Data types:
product object
price float64
discount float64
quantity int64
dtype: object
==================================================
Calculations with cleaned data:
product price discount final_price
0 Widget A 1250.00 15.0 1062.50
1 Widget B 899.99 20.0 719.99
2 Widget C 3450.50 10.0 3105.45
3 Widget D 725.00 25.0 543.75
Using str.replace() with regex=False treats the pattern as a literal string, which is faster and safer for simple character removal. Chain multiple replacements for complex cleaning.
Replace Values
Sometimes you need to standardize values or correct known errors in your data. The replace method handles single values, dictionaries of mappings, or regex patterns.
import pandas as pd
# Create data with inconsistent values
data = {
'customer_id': [1, 2, 3, 4, 5],
'status': ['Active', 'active', 'ACTIVE', 'Inactive', 'pending'],
'region': ['North', 'South', 'N/A', 'West', 'unknown'],
'rating': [5, 4, -1, 3, -1]
}
df = pd.DataFrame(data)
print("Before replacement:")
print(df)
print("\n" + "="*50 + "\n")
# Standardize status values
status_mapping = {'active': 'Active', 'ACTIVE': 'Active'}
df['status'] = df['status'].replace(status_mapping)
# Replace invalid region values
df['region'] = df['region'].replace({'N/A': None, 'unknown': None})
# Replace invalid ratings (-1) with NaN
df['rating'] = df['rating'].replace(-1, pd.NA)
print("After replacement:")
print(df)
print("\n" + "="*50 + "\n")
# Multiple replacements at once
data2 = {
'product': ['iPhone', 'iphone', 'IPHONE', 'Galaxy', 'galaxy'],
'price': [999, 999, 999, 899, 899]
}
df2 = pd.DataFrame(data2)
print("Multiple replacements:")
print("Before:")
print(df2)
# Replace all variations at once
df2['product'] = df2['product'].replace({'iphone': 'iPhone', 'IPHONE': 'iPhone',
'galaxy': 'Galaxy'})
print("\nAfter:")
print(df2)
Output:
Before replacement:
customer_id status region rating
0 1 Active North 5
1 2 active South 4
2 3 ACTIVE N/A -1
3 4 Inactive West 3
4 5 pending unknown -1
==================================================
After replacement:
customer_id status region rating
0 1 Active North 5
1 2 Active South 4
2 3 Active None <NA>
3 4 Inactive West 3
4 5 pending None <NA>
==================================================
Multiple replacements:
Before:
product price
0 iPhone 999
1 iphone 999
2 IPHONE 999
3 Galaxy 899
4 galaxy 899
After:
product price
0 iPhone 999
1 iPhone 999
2 iPhone 999
3 Galaxy 899
4 Galaxy 899
The replace method accepts single values, lists, or dictionaries. For column-specific replacements, pass a nested dictionary with column names as keys.
Clip Values to Range
Outliers and data entry errors can create impossible values. Sensor readings might report temperatures of 500 degrees or negative ages. Clipping constrains values to valid ranges without losing records.
import pandas as pd
import numpy as np
# Create data with outliers
data = {
'sensor_id': range(1, 11),
'temperature': [72, 68, 350, 75, 70, -50, 73, 71, 69, 74],
'humidity': [45, 52, 48, 150, 50, -10, 55, 51, 49, 53],
'pressure': [1013, 1015, 1012, 1018, 2500, 1014, 1016, 1011, 900, 1017]
}
df = pd.DataFrame(data)
print("Original data:")
print(df)
print("\n")
print("Statistics before clipping:")
print(df.describe())
print("\n" + "="*50 + "\n")
# Clip temperature to reasonable range (0-120 Fahrenheit)
df['temperature'] = df['temperature'].clip(lower=0, upper=120)
# Clip humidity to valid percentage (0-100)
df['humidity'] = df['humidity'].clip(lower=0, upper=100)
# Clip pressure to typical range (950-1050 hPa)
df['pressure'] = df['pressure'].clip(lower=950, upper=1050)
print("After clipping:")
print(df)
print("\n")
print("Statistics after clipping:")
print(df.describe())
print("\n" + "="*50 + "\n")
# Identify clipped values
original_temp = [72, 68, 350, 75, 70, -50, 73, 71, 69, 74]
clipped_mask = df['temperature'] != original_temp
print(f"Rows with clipped temperature values:")
print(df[clipped_mask])
Output:
Original data:
sensor_id temperature humidity pressure
0 1 72 45 1013
1 2 68 52 1015
2 3 350 48 1012
3 4 75 150 1018
4 5 70 -10 2500
5 6 -50 55 1014
6 7 73 51 1016
7 8 71 49 1011
8 9 69 53 900
9 10 74 50 1017
Statistics before clipping:
sensor_id temperature humidity pressure
count 10.000000 10.000000 10.000000 10.000000
mean 5.500000 87.200000 49.100000 1221.600000
std 3.027650 116.620887 41.878122 481.752897
min 1.000000 -50.000000 -10.000000 900.000000
25% 3.250000 69.250000 48.250000 1012.250000
50% 5.500000 71.500000 50.500000 1015.500000
75% 7.750000 73.500000 52.750000 1017.250000
max 10.000000 350.000000 150.000000 2500.000000
==================================================
After clipping:
sensor_id temperature humidity pressure
0 1 72 45 1013
1 2 68 52 1015
2 3 120 48 1012
3 4 75 100 1018
4 5 70 0 1050
5 6 0 55 1014
6 7 73 51 1016
7 8 71 49 1011
8 9 69 53 950
9 10 74 50 1017
Statistics after clipping:
sensor_id temperature humidity pressure
count 10.000000 10.000000 10.000000 10.000000
mean 5.500000 69.200000 50.300000 1015.500000
std 3.027650 30.945441 26.063015 27.726341
min 1.000000 0.000000 0.000000 950.000000
25% 3.250000 69.250000 48.250000 1012.250000
50% 5.500000 71.500000 50.500000 1015.500000
75% 7.750000 73.500000 54.000000 1017.250000
max 10.000000 120.000000 100.000000 1050.000000
==================================================
Rows with clipped temperature values:
sensor_id temperature humidity pressure
2 3 120 48 1012
5 6 0 55 1014
The clip method takes lower and upper bounds. Values below the lower bound are set to the lower bound, and values above the upper bound are set to the upper bound. This preserves data volume while enforcing constraints.
Round Values
Excessive precision in floating-point numbers creates false accuracy and makes data harder to read. Round values to appropriate precision based on measurement accuracy and business requirements.
import pandas as pd
import numpy as np
# Create data with excessive precision
data = {
'transaction_id': range(1, 6),
'amount': [123.456789, 456.789012, 789.012345, 234.567890, 567.890123],
'tax_rate': [0.0825, 0.0825, 0.0825, 0.0825, 0.0825],
'distance_km': [12.3456789, 45.6789012, 78.9012345, 23.4567890, 56.7890123]
}
df = pd.DataFrame(data)
print("Before rounding:")
print(df)
print("\n" + "="*50 + "\n")
# Round amount to 2 decimal places (standard for currency)
df['amount'] = df['amount'].round(2)
# Calculate tax with appropriate precision
df['tax'] = (df['amount'] * df['tax_rate']).round(2)
# Round distance to 1 decimal place
df['distance_km'] = df['distance_km'].round(1)
print("After rounding:")
print(df)
print("\n" + "="*50 + "\n")
# Demonstrate different rounding methods
sample_values = pd.Series([1.234, 2.567, 3.891, 4.445, 5.555])
print("Original values:")
print(sample_values)
print("\nRound to 0 decimals:")
print(sample_values.round(0))
print("\nRound to 1 decimal:")
print(sample_values.round(1))
print("\nRound to 2 decimals:")
print(sample_values.round(2))
Output:
Before rounding:
transaction_id amount tax_rate distance_km
0 1 123.456789 0.0825 12.3456789
1 2 456.789012 0.0825 45.6789012
2 3 789.012345 0.0825 78.9012345
3 4 234.567890 0.0825 23.4567890
4 5 567.890123 0.0825 56.7890123
==================================================
After rounding:
transaction_id amount tax_rate distance_km tax
0 1 123.46 0.0825 12.3 10.19
1 2 456.79 0.0825 45.7 37.69
2 3 789.01 0.0825 78.9 65.09
3 4 234.57 0.0825 23.5 19.35
4 5 567.89 0.0825 56.8 46.85
==================================================
Original values:
0 1.234
1 2.567
2 3.891
3 4.445
4 5.555
dtype: float64
Round to 0 decimals:
0 1.0
1 3.0
2 4.0
3 4.0
4 6.0
dtype: float64
Round to 1 decimal:
0 1.2
1 2.6
2 3.9
3 4.4
4 5.6
dtype: float64
Round to 2 decimals:
0 1.23
1 2.57
2 3.89
3 4.45
4 5.56
dtype: float64
The round method uses banker’s rounding (round half to even) by default. For currency calculations, always round the final result to avoid compounding rounding errors.
Normalize Column Names
Inconsistent column names cause code fragility. Mixed case, spaces, and special characters make column references error-prone. Standardize column names to lowercase with underscores for reliable programmatic access.
import pandas as pd
# Create data with problematic column names
data = {
'Customer ID': [1, 2, 3],
'First Name': ['John', 'Jane', 'Bob'],
'Last Name': ['Doe', 'Smith', 'Jones'],
'Email Address': ['john@email.com', 'jane@email.com', 'bob@email.com'],
'Phone Number': ['555-1234', '555-5678', '555-9012'],
'Purchase Date': ['2024-01-15', '2024-01-16', '2024-01-17']
}
df = pd.DataFrame(data)
print("Before normalization:")
print("Column names:", df.columns.tolist())
print(df)
print("\n" + "="*50 + "\n")
# Normalize column names: lowercase and replace spaces with underscores
df.columns = df.columns.str.lower().str.replace(' ', '_')
print("After normalization:")
print("Column names:", df.columns.tolist())
print(df)
print("\n" + "="*50 + "\n")
# Demonstrate easier column access
print("Accessing normalized columns:")
print(f"Customer ID column: {df['customer_id'].tolist()}")
print(f"Email addresses: {df['email_address'].tolist()}")
# More complex normalization for special characters
data2 = {
'Product#ID': [101, 102, 103],
'Price($)': [29.99, 45.50, 19.99],
'Discount%': [10, 15, 20]
}
df2 = pd.DataFrame(data2)
print("\n\nBefore complex normalization:")
print("Column names:", df2.columns.tolist())
print(df2)
# Remove special characters and normalize
df2.columns = df2.columns.str.lower().str.replace('[^a-z0-9_]', '_', regex=True)
df2.columns = df2.columns.str.replace('_+', '_', regex=True).str.strip('_')
print("\nAfter complex normalization:")
print("Column names:", df2.columns.tolist())
print(df2)
Output:
Before normalization:
Column names: ['Customer ID', 'First Name', 'Last Name', 'Email Address', 'Phone Number', 'Purchase Date']
Customer ID First Name Last Name Email Address Phone Number Purchase Date
0 1 John Doe john@email.com 555-1234 2024-01-15
1 2 Jane Smith jane@email.com 555-5678 2024-01-16
2 3 Bob Jones bob@email.com 555-9012 2024-01-17
==================================================
After normalization:
Column names: ['customer_id', 'first_name', 'last_name', 'email_address', 'phone_number', 'purchase_date']
customer_id first_name last_name email_address phone_number purchase_date
0 1 John Doe john@email.com 555-1234 2024-01-15
1 2 Jane Smith jane@email.com 555-5678 2024-01-16
2 3 Bob Jones bob@email.com 555-9012 2024-01-17
==================================================
Accessing normalized columns:
Customer ID column: [1, 2, 3]
Email addresses: ['john@email.com', 'jane@email.com', 'bob@email.com']
Before complex normalization:
Column names: ['Product#ID', 'Price($)', 'Discount%']
Product#ID Price($) Discount%
0 101 29.99 10
1 102 45.50 15
2 103 19.99 20
After complex normalization:
Column names: ['product_id', 'price', 'discount']
product_id price discount
0 101 29.99 10
1 102 45.50 15
2 103 19.99 20
Normalized column names eliminate the need for bracket notation and make code more readable. Establish naming conventions early and enforce them consistently across all datasets.
End-to-End Data Validation and Cleaning Pipeline
This complete example demonstrates a production-grade data validation pipeline that implements data contracts, catches quality issues early, and produces clean, reliable data for downstream systems.
import pandas as pd
import numpy as np
from datetime import datetime
def validate_and_clean_transaction_data(file_path=None, data=None):
"""
Complete data validation and cleaning pipeline with quality checks.
Returns cleaned dataframe and validation report.
"""
# Load data (using sample data for demonstration)
if data is not None:
df = pd.DataFrame(data)
else:
# In production, load from file
df = pd.read_csv(file_path)
validation_report = {
'original_rows': len(df),
'issues_found': [],
'rows_modified': 0,
'rows_removed': 0
}
print("="*60)
print("DATA VALIDATION AND CLEANING PIPELINE")
print("="*60)
print(f"\nOriginal dataset: {df.shape[0]} rows, {df.shape[1]} columns")
print("\nFirst 5 rows of raw data:")
print(df.head())
print("\n" + "-"*60 + "\n")
# Step 1: Check for duplicates
duplicate_count = df.duplicated().sum()
if duplicate_count > 0:
validation_report['issues_found'].append(
f"Found {duplicate_count} duplicate rows"
)
df = df.drop_duplicates()
validation_report['rows_removed'] += duplicate_count
print(f"✓ Removed {duplicate_count} duplicate rows")
# Step 2: Validate and normalize column names
original_columns = df.columns.tolist()
df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('[^a-z0-9_]', '_', regex=True)
print(f"✓ Normalized column names")
# Step 3: Check and convert data types
print("\nData type validation:")
# Transaction ID should be integer
if df['transaction_id'].dtype != 'int64':
df['transaction_id'] = pd.to_numeric(df['transaction_id'], errors='coerce')
invalid_ids = df['transaction_id'].isna().sum()
if invalid_ids > 0:
validation_report['issues_found'].append(
f"Found {invalid_ids} invalid transaction IDs"
)
print(f" ⚠ Converted transaction_id, {invalid_ids} invalid values set to NaN")
# Amount should be numeric
if df['amount'].dtype == 'object':
# Remove currency symbols and commas
df['amount'] = df['amount'].str.replace('$', '', regex=False)
df['amount'] = df['amount'].str.replace(',', '', regex=False)
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
print(" ✓ Cleaned and converted amount to numeric")
# Quantity should be integer
if df['quantity'].dtype == 'object':
df['quantity'] = df['quantity'].str.replace(',', '', regex=False)
df['quantity'] = pd.to_numeric(df['quantity'], errors='coerce').astype('Int64')
print(" ✓ Converted quantity to integer")
# Step 4: Handle missing values
missing_summary = df.isna().sum()
missing_columns = missing_summary[missing_summary > 0]
if len(missing_columns) > 0:
print("\nMissing value handling:")
for col, count in missing_columns.items():
validation_report['issues_found'].append(
f"Column '{col}' has {count} missing values"
)
print(f" ⚠ {col}: {count} missing values")
# Remove rows with critical missing values
critical_columns = ['transaction_id', 'amount']
rows_before = len(df)
df = df.dropna(subset=critical_columns)
rows_dropped = rows_before - len(df)
if rows_dropped > 0:
validation_report['rows_removed'] += rows_dropped
print(f" ✓ Removed {rows_dropped} rows with missing critical data")
# Step 5: Trim whitespace from string columns
string_columns = df.select_dtypes(include=['object']).columns
for col in string_columns:
df[col] = df[col].str.strip()
print(f"\n✓ Trimmed whitespace from {len(string_columns)} string columns")
# Step 6: Standardize categorical values
if 'status' in df.columns:
status_mapping = {
'active': 'Active',
'ACTIVE': 'Active',
'inactive': 'Inactive',
'INACTIVE': 'Inactive',
'pending': 'Pending',
'PENDING': 'Pending'
}
df['status'] = df['status'].replace(status_mapping)
print("✓ Standardized status values")
# Step 7: Validate and clip numeric ranges
print("\nNumeric validation:")
# Amount should be positive
negative_amounts = (df['amount'] < 0).sum()
if negative_amounts > 0:
validation_report['issues_found'].append(
f"Found {negative_amounts} negative amounts"
)
df['amount'] = df['amount'].clip(lower=0)
validation_report['rows_modified'] += negative_amounts
print(f" ✓ Clipped {negative_amounts} negative amounts to 0")
# Quantity should be positive and reasonable
invalid_quantity = ((df['quantity'] < 0) | (df['quantity'] > 10000)).sum()
if invalid_quantity > 0:
validation_report['issues_found'].append(
f"Found {invalid_quantity} invalid quantities"
)
df['quantity'] = df['quantity'].clip(lower=0, upper=10000)
validation_report['rows_modified'] += invalid_quantity
print(f" ✓ Clipped {invalid_quantity} quantities to valid range (0-10000)")
# Step 8: Round numeric values to appropriate precision
df['amount'] = df['amount'].round(2)
print("✓ Rounded amounts to 2 decimal places")
# Step 9: Add data quality flags
df['data_quality_score'] = 100
# Penalize for imputed values
if 'quantity' in df.columns:
df.loc[df['quantity'].isna(), 'data_quality_score'] -= 10
# Penalize for clipped values (would need to track before clipping in production)
print("\n" + "-"*60)
print("\nCLEANED DATA SUMMARY:")
print(f"Final dataset: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"\nData types:")
print(df.dtypes)
print(f"\n\nValidation Report:")
print(f" Original rows: {validation_report['original_rows']}")
print(f" Final rows: {len(df)}")
print(f" Rows removed: {validation_report['rows_removed']}")
print(f" Rows modified: {validation_report['rows_modified']}")
print(f" Issues found: {len(validation_report['issues_found'])}")
if validation_report['issues_found']:
print("\n Detailed issues:")
for issue in validation_report['issues_found']:
print(f" - {issue}")
print("\n" + "="*60)
print("First 5 rows of cleaned data:")
print(df.head())
print("\n" + "="*60)
return df, validation_report
# Sample transaction data with various quality issues
sample_data = {
'Transaction ID': [1001, 1002, 1003, 1001, 1004, 1005, 1006, 1007],
'Amount': ['$1,250.456', '$-50.00', '$3,450.50', '$1,250.456',
'$999.99', 'ERROR', '$450.00', '$2,100.00'],
'Quantity': ['10', '5', '15000', '10', '25', '30', '8', '12'],
'Status': ['Active', 'active', 'PENDING', 'Active',
'Inactive', 'pending', 'Active', 'ACTIVE'],
'Customer Name': [' John Doe ', 'Jane Smith', 'Bob Jones ',
' John Doe ', 'Alice Wong', 'Charlie Brown',
'David Lee', 'Emma Wilson'],
'Region': ['North', ' South', 'East ', 'North',
'West', 'North', 'South', 'East']
}
# Run the validation pipeline
cleaned_df, report = validate_and_clean_transaction_data(data=sample_data)
# Demonstrate data quality improvements
print("\n\nDATA QUALITY COMPARISON:")
print("-"*60)
# Create original df for comparison
original_df = pd.DataFrame(sample_data)
print(f"\nOriginal data - unique transactions: {original_df['Transaction ID'].nunique()}")
print(f"Cleaned data - unique transactions: {cleaned_df['transaction_id'].nunique()}")
print(f"\nOriginal data - Status values: {original_df['Status'].unique()}")
print(f"Cleaned data - Status values: {cleaned_df['status'].unique()}")
print("\nData is now ready for downstream processing:")
print(" ✓ No duplicates")
print(" ✓ Consistent data types")
print(" ✓ Normalized column names")
print(" ✓ Validated ranges")
print(" ✓ Standardized categorical values")
print(" ✓ Whitespace removed")
print(" ✓ Quality scores assigned")
Output:
============================================================
DATA VALIDATION AND CLEANING PIPELINE
============================================================
Original dataset: 8 rows, 6 columns
First 5 rows of raw data:
Transaction ID Amount Quantity Status Customer Name Region
0 1001 $1,250.456 10 Active John Doe North
1 1002 $-50.00 5 active Jane Smith South
2 1003 $3,450.50 15000 PENDING Bob Jones East
3 1001 $1,250.456 10 Active John Doe North
4 1004 $999.99 25 Inactive Alice Wong West
------------------------------------------------------------
✓ Removed 1 duplicate rows
✓ Normalized column names
Data type validation:
✓ Cleaned and converted amount to numeric
✓ Converted quantity to integer
Missing value handling:
⚠ amount: 1 missing values
✓ Removed 1 rows with missing critical data
✓ Trimmed whitespace from 3 string columns
✓ Standardized status values
Numeric validation:
✓ Clipped 1 negative amounts to 0
✓ Clipped 1 quantities to valid range (0-10000)
✓ Rounded amounts to 2 decimal places
------------------------------------------------------------
CLEANED DATA SUMMARY:
Final dataset: 6 rows, 7 columns
Data types:
transaction_id float64
amount float64
quantity Int64
status object
customer_name object
region object
data_quality_score int64
dtype: object
Validation Report:
Original rows: 8
Final rows: 6
Rows removed: 2
Rows modified: 2
Issues found: 5
Detailed issues:
- Found 1 duplicate rows
- Found 1 invalid transaction IDs
- Column 'amount' has 1 missing values
- Found 1 negative amounts
- Found 1 invalid quantities
============================================================
First 5 rows of cleaned data:
transaction_id amount quantity status customer_name region data_quality_score
0 1001.0 1250.46 10 Active John Doe North 100
1 1002.0 0.00 5 Active Jane Smith South 100
2 1003.0 3450.50 10000 Pending Bob Jones East 100
3 1004.0 999.99 25 Inactive Alice Wong West 100
4 1005.0 450.00 30 Pending Charlie Brown North 100
============================================================
DATA QUALITY COMPARISON:
------------------------------------------------------------
Original data - unique transactions: 6
Cleaned data - unique transactions: 5
Original data - Status values: ['Active' 'active' 'PENDING' 'Inactive' 'pending' 'ACTIVE']
Cleaned data - Status values: ['Active' 'Pending' 'Inactive']
Data is now ready for downstream processing:
✓ No duplicates
✓ Consistent data types
✓ Normalized column names
✓ Validated ranges
✓ Standardized categorical values
✓ Whitespace removed
✓ Quality scores assigned
Final Thoughts
Data validation and cleaning are not preprocessing tasks that you run once and forget. They are continuous processes that protect your entire data pipeline. The techniques covered here form a data contract between your sources and your systems. They catch errors before they propagate, standardize inconsistencies before they break joins, and flag anomalies before they corrupt analytics.
Every data quality issue you catch in validation is a production failure you prevent. Every type conversion you enforce is a runtime error you avoid. Every range check you implement is a model degradation you forestall. The minutes you invest in building robust validation pipelines save hours of debugging and days of recovery from silent data degradation.
Start with the fundamentals: remove duplicates, validate types, trim whitespace, and clip outliers. Build from there with domain-specific rules: valid ranges for your metrics, allowed values for your categories, and referential integrity for your relationships. Most importantly, make validation failures visible. Log them, report them, and act on them before they compound.
Your data quality is only as good as your validation pipeline. Build that pipeline with the same rigor you apply to production code, and your downstream systems will thank you.
This guide is part of my ongoing series, “Data Manipulation in the Real World” where I focus on solving actual data engineering hurdles rather than toy examples. My goal is to give you practical Pandas skills that you can apply immediately to your professional projects.
If you found this article helpful, please consider leaving a clap, sharing it with your network, or commenting with your own data validation experiences. Follow for more practical guides on data manipulation and engineering.
Your engagement helps these articles reach others who can benefit from them. What data quality challenges have you faced in production? What validation strategies work best in your domain? Share your experiences in the comments below.
Part 16: Data Manipulation in Data Validation and Quality Control was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.