Home Data Prep for Data Science, AI and ML

Paxata Community Members: Something special in a community experience is coming your way. Stay tuned to this space.
In the meantime, check out the brand new Data Prep for Data Science topic here and the new DataRobot Community.

Visit the official Paxata Documentation portal for all of your doc needs.

My categorical variables are messy and inconsistent. How do I normalize them?

MelanieMelanie Posts: 70 admin

Often, you'll find that the values for your categorical variables are messy and inconsistent--for example NY versus New York. But there's an easy way to clean this up using our Cluster and Edit algorithms. We'll provide a quick example here of how you can do this in the app. And then for a complete understanding of our clustering algorithms, please refer to our documentation for the Cluster and Edit feature.

Example time
You receive a training dataset with a variable for state. Unfortunately, the values for the variable are inconsistent. There's "NewYorke", "Neu York", "CAlifornia", "California", etc.

Normalizing this column is super simple with the Cluster and Edit feature. From the variable's column, click the drop-down arrow and select "cluster + edit":

The Cluster+Edit menu opens. Notice that, by default, "metaphone" is the algorithm used for clustering, which is powerful for data with misspellings in which words sound similar but are spelled differently. However, for this example, let's switch to the "fingerprint" algorithm. It's the most restrictive of the algorithms and a good place to start because it groups similar values into a cluster where the only differences are: punctuation, word order and capitalization.

ProTip: if you're still exploring your data, start with the "fingerprint" algorithm. Make your edits. Then switch to the "metaphone" algorithm and cluster once more to see what other potential clusters exist in the data.

Now just enter the desired normalized values and click the green Save button.

You're done! And if you want to sanity check your work to see what else may be lurking in a variable's column, just open a Filtergram for that column. 

Sign In or Register to comment.