Paxata Community Members: Something special in a community experience is coming your way. Stay tuned to this space.
Exploratory Data Analysis: histograms to help you better understand and prep your data
Exploratory analysis is a critical piece of your data prep work. At a minimum, you want to perform some initial investigations to spot potential patterns and anomalies, across your entire dataset (not a sample), from which you can begin to form hypotheses. With Paxata, your exploratory work is made easier through our powerful Filtergrams tool, which produces histograms for your dataset's variables.
Filtergrams are also super useful for data
cleansing work after your initial exploration is complete. Filtergrams are used to identify and redress missing or incomplete values, and to flag unwanted observations in a dataset that can then be
removed using the Remove Rows tool.
For a deep dive on every angle of Filtergrams, of which there are many, check out our official Filtergrams documentation.
Let's explore a simple example here of how to open a Filtergram in Paxata, use it for exploratory analysis, and then cover a couple of quick examples of how to further clean based on what the histogram reveals.
Let's look at a dataset with medicare data--and just for convenience of quickly illustrating what kind of data we're looking at, I've used the columns tool to quickly peek at all of the variables in this dataset. You'll find the columns tool on the left side of the screen and it looks like this:
Quick ProTip: the columns tool is a super powerful tool that allows you to manipulate the feature variables in your data--for example removing ones you don't want to keep--but it's also a great little tool for quickly viewing all of the variables, and their types, in your dataset, which is much faster than scrolling across the dataset to view each one on the data grid. See Columns in a Paxata Project for a deep dive on the tool.
So here are the variables (columns) in the data we're going to explore:
In this example, I'm interested in "period" because I happen to know that it has data related to hospital admission and discharge dates, which tie into the average spend data. I want to better understand all of the values for that variable because I was surprised to see (in the image above) that it's a text type variable. I was expecting it be numeric--a number to indicate a period of time. Using a filtergram, I can quickly produce a list of every value used for the "period" variable.
To open the filtergram: first ensure you're looking at the data grid. Then click the down arrow for the variable's column and select "Filter values":
A historgram then opens and reveals that "period" has been coded in these fairly untidy four ways:
- "During Index Hospital Admission" = 1
- "1-3 days Prior to Index Hospital Admission" = 2
- Perform a Find and Replace for each of the text string instances and replace with the numeric assignment. The Find and Replace option is in the variable's drop-down menu--the same menu from which you opened the Filtergram.
ProTip: you can double-click in any cell, and the Find and Replace menu is automatically invoked.
- After you have all of the string values replaced with your numeric assignments, convert the column to type numeric. This is also done through the column menu--CHANGE into... numeric
With those two transformations complete, I open a Filtergram once more for the variable to ensure the resulting transformations are as I want them. And now I see not only more tidy values for my "period" variable, I also see that a new type of histogram is now displayed-- because this variable is now of type numeric.
And with that variable tidy, I can carry on with my prep work