Photo by Franki Chamaki on Unsplash

Seven Easy Data Validation Checks

Ehsan Khodabandeh
3 min readJun 5, 2018

--

If you work with data, you know how crucial it is to validate it before modeling. As critical as it is, data validation is often overlooked during a project as it’s often perceived as less interesting in comparison to other aspects of the modeling process.

As I shared in a previous post, I believe that checklists are useful tools in overcoming failures and reducing errors, especially in routine tasks. This time, I want to share one of the checklists I often use as soon as I receive a new dataset. Keep in mind: I don’t intend to offer an overly reductive account of data cleaning and validation. This is a checklist of the simplest and most important validation checks that one can perform easily to quickly find and fix some of the most obvious errors in the data.

1. Column data types and underlying data

If you have a table, check the data types of the columns to ensure that they are what they’re supposed to be. This may not seem intuitive, but zipcodes, for example, should be strings instead of a numeric type. What certain languages and systems do to leading zeros means that you may mistakenly retrieve a zipcode in New Jersey (such as 07885 of Wharton, NJ) as 7885 — this is actually the postal code for the West Coast of New Zealand! So keep an eye out for the data types of zipcode fields in your data (Holtsville in NY and some places in Puerto Rico and Virgin Islands have zipcodes with two leading zeros).

2. When in doubt, use string or text as the format — you can change it later if needed

Check the dates and times and ensure they are in the right format. When in doubt, make everything text instead of numeric. It’s easier to convert text to any data type you need.

3. For dates, times and duration values, check to make sure the values in your data are all positive

A negative duration means your end date/time is before your start date/time, which usually isn’t a good sign. Also, look at their maximum and minimum values. If you are dealing with data from last year, then logically you shouldn’t see dates belonging to next year or a decade ago.

4. Determine if there are negatives or nulls where you don’t expect them

Do a quick check on the minimum and maximum of each column or feature, and also check if there are any “Null,” “None,” or missing values as well. These statistics give you a better overall picture of errors in the features. For example, if you expect to receive the daily data on trucks leaving a warehouse, seeing very small weights like 1lb, or weights more than the truck’s capacity, are questionable.

5. Check for standardization of data (five-digit or nine-digit zips, timezones, lat/lon formats, etc.)

Know what latitude and longitude are supposed to show, and use them to perform a sanity check. Latitude shows the degree to which a point is North or South with positive and negative numbers, respectively. Longitude shows the East and West with positive and negative numbers, respectively. So, if you have a country in North America and you see anything other than positive latitude (it’s in the Northern hemisphere) and negative longitude (it’s West), then your data is not correct.

6. Confirm the scope and accuracy of your data by checking key aggregate metrics like volumes, cost, and others

Use aggregate functions like sum, count, min, max, and other related operations to help validate data completeness and accuracy.

7. Excel can cause data issues; when available, use CSVs instead

Finally, if you can choose between using an Excel file (.xls or .xlsx format) or a CSV (.csv), pick the latter. Excel automatically tries to find the best type for data it opens, which often can cause trouble. Also, be aware that opening a CSV in Excel in an attempt to validate the data could alter the data by incorrectly casting field types or even truncating cell values.

If you liked this blog post, let’s stay in touch on Medium, Twitter, and LinkedIn.

--

--

Ehsan Khodabandeh

Operations Research Scientist. I write about optimization, logistics, and occasionally Python!