An Exploratory Data Analysis Tool Comparison

Assisted Exploratory Data Analysis (EDA) is gaining popularity with the data science community. As the Kaggle CTO tweeted data-science is 90% understanding your data yet most tools are focused on automating the coding rather than the data analysis. This article breaks down two of the best tools Pandas Profiling and Kortical - ML Data Prep.
Clearly working for Kortical, I am very much team Kortical but I have attempted to make as fair a comparison as possible. While I'll start with a high level comparison chart, every point I make is fully self evident in the detailed usage comparison below said chart.
When pointing out timings, features, etc. I’ve based it on a comparison of that famous Kaggle dataset House prices: Advanced Regression Techniques. This was chosen as it is one of the starter datasets for Kaggle and it is slightly larger at 461KB. Most criticism levelled at Pandas Profiling is about how useful it is for large datasets but at 461KB it is tiny compared to almost any real life dataset we’ve seen.
For the chart below I’ve highlighted any obvious winner for a given feature in green.
Pandas profiling creates a local webserver so it is accessed in the browser as is Kortical. It gives you a huge and I mean HUGE list of columns to scroll through. This means a lot of legwork to find out what we need to do feature engineering on, what’s important and actually makes a difference to machine learning.
Interact with the image on the right to see how much scrolling is needed to see all the data columns on Pandas.

Pandas endless scrolling makes data anlysis feel like a never-ending chore.
You could use their correlations charts to try and spot which variables are highly correlated with SalePrice. If you look closely you can see that OverallQual seems darker than the other columns, though you might struggle to keep ranking the correlated variables by visual matching.

Kortical - ML Data prep on the other hand has a features importance chart and orders your variables by importance. This is able to find much more complex relationships than simple correlations. It also lets you know what columns can be safely dropped. There are many more insights it could have surfaced but these are the ones that are relevant for this dataset.

So if you’re looking through the columns one by one in Pandas Profiling you would have to figure out that OverallQual is important from this information. It does have a bunch of useful statistics but it can be hard to imagine their relationship to the target from this data alone.

In Kortical the important columns are surfaced to you in order of importance, so you can just start at the top and work your way down, knowing that the columns you’re looking at are very impactful. Kortical also shows you the relationship of column and the target, so you can understand why it’s predictive. Below you can easily see that the SalePrice goes higher as the OverallQaul score increases, showing a clear relationship.

You can also see the data distribution for the column.

Pandas profiling has a number of charts but by far the most useful is the first called ‘Counts’. This shows which columns have a large proportion of missing values. It is not zoomable, so if you can’t make out the column name it becomes pretty useless but you can still get a sense of the overall shape and volume of missing data.

In Kortical the platform has a very similar chart but it is zoomable to be able to see any level of detail. It’s also colour coded, as completeness is not just about missing values but if a single value dominates too much. Green lets you know there is a good mix of values and pink tells you that a single value dominates the column. This can be ok but if you see a pink column that you would expect to have a good mix of values you should investigate. Kortical also offers a plaintext description of what’s going on and the check boxes allow you to select columns to remove but more on these later.

We can zoom in on problem columns and check columns we want to remove.

Pandas profiling doesn’t offer any target insights but Kortical has a good deal of useful info such as the problem type, possible evaluation metrics, ways to reframe the problem for better ML results and indicative Mean Absolute Error for different value ranges of SalePrice. Showing us that ML can slightly overestimate the value of the lowest range and underestimate in a big way the top range, while most middle ranges tend to be pretty accurate. This is not supposed to be the final model but to give an indication of performance.

Again this is unique to Kortical. What this section is about, is taking the various actions that Kortical suggests. For this dataset it is offering to remove the 12 columns that were surfaced through the insights as either not having enough data or having no bearing on the prediction. For other datasets it might be removing leaking variables or same casing categorical features, where the same label features multiple times, just with different capitalisation and many more potential transformations, surfaced by the insights. You can control which actions to take using the tick boxes near the insights and this section just lets you implement those actions automatically, leaving more time to focus on the more advanced machine learning features.
By offering automatic ways to remove redundant columns for machine learning, automatic leaking variable detection, it’s offering a way to cut down dataset sizes at the outset that simplifies and takes time and effort out of data exploration and training models. From here data can be saved or with the full platform you can create a model using our top ranked AutoML.

With Kortical - ML Data Prep we’ve taken assisted data exploration and data preparation for machine learning to the next level but we’re only getting started, sign up for free today and let us know what you think of it.

Contact us to see a focused demo and explore the quickest path to production.
Thank you!
A Kortical team member will be in touch shortly
