Paxata Community Members: Something special in a community experience is coming your way. Stay tuned to this space.
This is a common data prep scenario and you can capture that Filtergram view in four easy Steps.
In this example, here's a Filtergram view of the Season1 column:
To capture the view above in an AnswerSet:
1. Perform a Group by on the column––in this example, the column named Season 1. And the resulting new aggregated column becomes the Count – Show Name column:
2. Sort the new Count – Show Name column by descending:
3. Now create a Lens for this view: Click the Lens tool, provide a Lens name, and Save the Lens to the Step:
4. Publish the lens for export. Note that when you mouse over the lens icon on the Step, the Publish button displays:
By modernizing our look in the 2020.1 release, we continue to deliver on our commitment to produce excellence while offering a slick integration experience for our DataRobot customers. And yep, there's even more goodness coming in this release too. For complete details of all our new features, be sure to check out the 2020.1 Release Notes.
Until then, happy new year and happy data prepping!
What is Data Preparation for Machine Learning?
Data preparation is the process of transforming raw data so that it's properly prepared for the machine learning algorithms used to uncover insights and make predictions.
Why is Data Preparation Important?
Most machine learning algorithms require data to be formatted in very specific ways. Which means your raw datasets generally require some amount of preparation before they can yield useful insights. For example, some datasets have values that are missing or invalid. If data is missing, the algorithm can’t use it. And if data is invalid, the algorithm produces less accurate or even misleading outcomes. Good data preparation produces clean and well-curated data that leads to more practical, accurate model outcomes.
So what can I do to prep my data?
Paxata provides the transformation tools you need to clean, normalize, and shape your data. And once you've cleaned your data, Paxata also provides the tools you need to prepare your Features for optimal feature engineering.
Here's just a short list of how Paxata can help you to quickly prep your data to train your ML models:
- join datasets for feature enrichment
- add target variables to a training dataset
- normalize messy categorical variables- for example NY versus New York, CA versus California
- select only specific variables to save in a training dataset
- remove unwanted observations
- format dates so they are recognized by the training model
- identify and redress missing or incomplete values
- bin ranges into categories
- compute windowed aggregates- for example rolling up sales transactions to a daily level
- profile a dataset to understand your data prior to prep
- perform exploratory data analysis
Intelligent Automation through the Automatic Project Flows (APF) feature allows you to intelligently operationalize curated data flows. With a single click, APF computes the entire sequence of data prep Steps across Paxata Projects, datasets and AnswerSets to produce an end-to-end, automated output Flow for your data. You can set the Flow to run on a recurring time-based schedule, or run it just once to produce an end-result AnswerSet. All runs can then be easily managed through the APF Monitoring Interface. For complete details, see Automatic Project Flows.
What are the key benefits of APF over the Automation feature?
Business Analysts and Data Engineers can simplify complex data flows by breaking them into smaller groups of Paxata Projects that can be operationalized—with each Project focused on performing a related or cohesive set of Steps for improved readability and limited complexity. When you're finished creating your Projects, simply select the final Project in the sequence as your "target" Project. APF takes care of the rest—sequencing, preparing and automating the entire end-to-end flow without any manual stitching required.
For teams that require input from both Business and the IT Leader, the data prep process can be simplified when members build Paxata Projects that depend on output AnswerSets created by others. Everyone completes their data prep work in their own Paxata Project, and then the entire sequence is operationalized from a single "target" Project. APF takes care of the rest with no manual stitching required, regardless of who created or owns the Projects and AnswerSets. Members of the team can then use the APF Monitoring Interface to view how their Projects and AnswerSets participate in the Flow's final output.
How do I migrate my legacy Automation job to APF?
The Paxata Customer Success team has built a utility that uses information about current Automation schedules and lists of Projects in a tenant, and then identifies all of the “Target Projects” for which an APF needs to be created. Once the APFs are created, users can set up their schedules, custom import options, and export configurations.
Can a customer have both Legacy Automation and Intelligent Automation?
No. Automation and APF cannot co-exist in the same tenant.
I don't see the APF feature in my software. What am I missing?
The APF is a feature is behind a feature flag. Contact Paxata Customer Success to enable the feature.
My Flow has a lot of input datasets and output projects. How do I see those details?
You can easily determine metadata statistics for the dataset inputs by hovering your mouse over a dataset name in the DATASETS column. The dataset's version, creation date and the user who added it to the Library, and the number of columns and rows are displayed in a pop-up window. The corresponding is true for Projects (Outputs) as well. From the pop-over, one can click on a button to navigate to the Dataset/Project as desired. For complete details, see Automatic Project Flows.
Legacy Automation allows me to configure import options. What about APF?
All of the datasets that are identified as part of a Flow are displayed on the Inputs tab. If a dataset was previously imported from a Connector data source, there is an option to re-import the dataset. When that option is selected, you’re provided with the ability to configure the re-import options, just like Legacy Automation.
With Automation, I can export to 3rd Party Data Sources. What about APF?
Yes! Export of Lens Output to data sources is just like how it works in Legacy Automation. The Outputs tab provides a list of all the output AnswerSets that are published from the Flow. Click the Configure Lens button for the lens and the Export panel opens at the bottom of the page. By default, AnswerSets are published to the Paxata Library. To also publish out to an external data source, click the drop-down for the Export Lens field and select "Library and Export". You can then specify the output location details and any export formatting options for that AnswerSet.
I have created an APF, but my Project script has changed since then. How do I update the Projects in a Flow?
You can update to the latest version of any of theProjects as long as the two Project versions have the same inputs and lenses. You can update a Flow to use the latest version of all Projects. This can be done from the "Actions" drop-down while on the Outputs tab. Select "Update Projects" and you are prompted to confirm your selection.For complete details, see Automatic Project Flows.
My Project consumes an AnswerSet that it produces. Can I create a Flow for it?
Yes. A Project consuming its own AnswerSet is called a self-loop and this is supported. When such a Project is involved in the Flow, the AnswerSet that causes the loop is brought in as a special input to the Project and is depicted by a dotted line for the looped input.
Does APF have REST API support?
Yes. Please contact Paxata Customer Success for the REST API documentation.
Do existing job limits/quotas apply for APF?
Yes. In the APF world, we consider each Dataset Import and Project Execution as a "chore". Before the execution of a chore begins, a check is performed to determine if the group/tenant has sufficient quotas available. If the quotas have been exhausted, we do not run the chore and then a run of the Flow is failed. Note: the quotas from the Legacy Automation feature are applied to APF.
When you visit a Project, you will see an icon in the top right-hand corner of the header next to the “Create Project Flow” button. Click this icon to find all the Flows in which the Project participates. Please note that if you don't have sufficient permissions to View a particular Flow, the search results will exclude it.
How do I share a Project Flow with my team?
In the Project Flows list page, every Flow that you have permissions to share has a clickable Permission button. Click on that button to adjust the Permissions on the Flow. You will be taken to a page that is very similar to how permissions work for Projects and datasets.
I have a Flow and there are many lenses in the Outputs tab that are enabled by default, and I am not able to disable them. Why?
If your lens has an indicator like the one displayed below, it means that the output from the lens is used by a Project in the Flow. Disabling such a lens would cause a failure when the Flow runs. So, we have prevented users from disabling these required lenses. You can, however, decide whether to publish the output to just Library or to also export it to a Data Source.
What happens if, during the run of a Flow, one of the imports or Projects fail?
If any chore (Dataset Import or Project Execution) fails during the run of a Flow, the entire run fails. You can see the chore status in the Run Details tab. Any chore that failed will have a “display errors and warnings” link, which, when clicked, opens a panel to display the errors encountered during execution.
Can I create more than one Flow for a Project?
Yes. However, best practices recommend that you determine if a Project is already participating in other Flows. If so, then you will want to be clear which versions of the Project each Flow is using.
How do I identify if a dataset/AnswerSet was produced as part of an APF Run?
In Library, on every dataset (or AnswerSet), a new menu item “Go to Run” has been added. If the dataset was produced by a Project Flow, the button is enabled. Clicking on the button takes you to the Run Details tab of the Run that produced the dataset.
Can I cancel the run of a Flow?
Yes, you can. While a run is in progress, you can visit the Run Details page for that run and click on the "Stop" button to stop a run that is in progress. This will put the run into a cancel mode and confirmation message is displayed. Any chores currently in progress will complete and then run will halt.
My Flow just started running with wrong configurations. What can I do now?
You can cancel the run of the Flow by clicking on the "Stop" button from the Run Details page. Once the UI acknowledges that a request to cancel has been made, you can run the Flow again.