Home Getting Started

Paxata Community Members: Something special in a community experience is coming your way. Stay tuned to this space.
In the meantime, check out the brand new Data Prep for Data Science topic here and the new DataRobot Community.

Visit the official Paxata Documentation portal for all of your doc needs.

2. Explore Your Data

mikeblankomikeblanko Posts: 10 mod
edited February 12, 2019 12:24AM in Getting Started

This is part 2 of the Getting Started tutorial. You just learned how to create a new project, now, learn how to explore your data.  This series of videos and written tutorials teach you the very basic steps of doing interactive data preparation with Paxata. 

Video: Exploring Your Data


 If you prefer to read instead of watch a video, the same steps from the above video are listed below.

Steps: Exploring Your Data

Topic Overview

Prior to applying transformations or filters to the data, you may have certain expectations of the data content and its quality. For example, the key attributes should be fully populated, clean, and standardized to allow for complete and accurate reporting. You may also need to confirm that there are no mismatched data types or unexpected outliers which could impact the processing in downstream systems.

Step by step

1) Let’s begin by highlighting patterns, spaces, and ranges within the data preview grid to get some at-a-glance insight.

We now see the following:

  • The Email column contains some NULL values
  • The City and State or Province columns appear in mixed case which indicates this data is not fully standardized and we should look at population of all the different values to confirm consistency before we perform any segmentation and further analysis on this dataset.
  • The Country values reveal another data consistency issue - we now see this is trailing whitespace for some of the values as indicated by the gold bar at the right end of the cell. We will need to review the contents of this column further to assess the extent of this problem.
  • The numeric range highlighting helps us get a sense for the range of values reflected in the Age column - looks like 18 is the minimum value and a value less than 80 or so is the max.
  • Lastly, we have a data sparsity issue on the Age and Gender columns - we see blank values and may need to take action by populating them or filtering them out. 

These initial insights give us an overall sense of the data quality issues and patterns we have at hand. To enable further analysis and a more complete understanding of the data we can generate a data filtergram on any column in this dataset. Filtergrams give you an interactive view of the entire data content by combining the power of filters for selecting subsets of your data with the intelligence of histograms for visualizing your data before, during, and after every transformation.

2) Open filtergrams on any column you’d like to profile and validate for data quality and consistency. A filtergram is generated by hovering over the downward facing triangle in the column header area and selecting “Filter values” from the column operations menu. 

Given the insights we gained earlier, let’s take a closer look at the State or Province column, the Country column, and the Age column—open filtergram for each of those columns: 

Looking at each of the filtergrams at the top of screen, we now learn the following: 

  • There are 70 unique values for State or Province represented in this dataset
  • The trailing whitespace issue results in what is technically 4 different Country values
  • Looking at the Age filtergram, we see the actual min and max when we hover over the filtergram:  
  • We also learn the extent of sparsity in this column by looking at the proportion of the data quality bar which is grey as compared to green:

  • We are able to easily filter on the blank values:    
  • as well as just those values which conform with the data type which in this case is numeric:    

·       We would see red for any values which are represented as a different data type in Paxata and another button would appear here labeled “Other” which we could filter as well in order to change the data type or to remove the unconformed values altogether from our project workflow.

  • Furthermore, you are able to sort:                                                                                                   
  • swap the value count to percentage:                                                                               
  • perform a full text search by clicking magnifying glass:                                                                                      Filtering is useful for segmenting your data or for setting a pre-condition for a column transformation or computed column. Refer to the “Data Filters” article in the Help Shelf to learn more about the built-in capabilities of the different types of filtergrams.    

That completes the second of six tutorials in the Getting Started series.

Next:  Cleaning and Transforming Your Data

  1. Create a New Data Prep Project
  2. Explore Your Data
  3. Clean and Transform Your Data
  4. Publish and Export Your Data  
  5. Combine Your Data - Append
  6. Combine Your Data - Lookup / Join


<<prev next >>
Sign In or Register to comment.