Visit the official Paxata Documentation portal for all of your doc needs.

Paxata Basic Installation Steps for Development Environment

The following is basic Paxata Installation Steps that is suitable for POC, TEST and non-Production Environments.

Specification of this installation

Six worker cores for pipeline use
25GB Free RAM per node
100GB Free Disk Space per node
Local FileSystem as library storage
Local MongoDB as metadata storage
Standalone apache Spark Cluster with 2 Spark Workers
HTTPS Jetty Web Server (bundled with Paxata Server)
No security / Kerberos


Recommended host assignment: 
Host1: Paxata Server, MongoDB, JDK7 
Host2: Paxata Pipeline, Spark Master, JDK7 
Host3: Spark Worker1, JDK7 
Host4: Spark Worker2, JDK7 

1. Java

Step 1A: Download Oracle JDK7_79 (if Oracle JDK7/8 is not already installed, Also OpenJDK is not supported): 

on host1, host2, host3 and host4 (make sure only one of below versions is installed on all hosts, do not mix JDK versions in hosts):

sudo curl -v -j -k -L -H "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/7u79-b15/jdk-7u79-linux-x64.rpm > /tmp/jdk-7u79-linux-x64.rpm

Step 1B: Install Java (if Oracle JDK7/8 is not already installed): 

on host1, host2, host3 and host4:

sudo yum localinstall /tmp/jdk-7u79-linux-x64.rpm


2. Spark

Step 2A: Download Apache Spark (1.6.1)
on host2,host3 and host4 
sudo wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz

Step 2B: Configure Spark on master and worker nodes. 
All of them should have identical spark-env.sh (in this example, host2 is spark master hostname, and spark provide 1 worker core and 8g worker memory per worker)

on host2, host3 and host4: 

sudo su
mkdir -p /usr/local/paxata/tmp; mkdir -p /usr/local/paxata/pid; mkdir -p /usr/local/paxata/work; mkdir -p /usr/local/paxata/log; mkdir -p /usr/local/paxata/cache; chmod 777 -R /usr/local/paxata/*;
mkdir -p /usr/local/paxata/tmp; mkdir -p /usr/local/paxata/pid; mkdir -p /usr/local/paxata/work; mkdir -p /usr/local/paxata/log; chmod 777 -R /usr/local/paxata/*;
mkdir -p /usr/local/paxata/spark; tar -xvzf spark-1.6.1-bin-hadoop2.6.tgz -C /usr/local/paxata/; mv /usr/local/paxata/spark-1.6.1-bin-hadoop2.6/* /usr/local/paxata/spark/; vi /usr/local/paxata/spark/conf/spark-env.sh


#!/usr/bin/env bash
### Change the following to specify a real cluster's Master host
export STANDALONE_SPARK_MASTER_HOST=host2
export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST
export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib
export SPARK_MASTER_WEBUI_PORT=18080
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=18081
export SPARK_WORKER_DIR=/usr/local/paxata/work
export SPARK_LOG_DIR=/usr/local/paxata/log
export SPARK_PID_DIR=/usr/local/paxata/pid
export SPARK_LOCAL_DIRS=/usr/local/paxata/tmp
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_CORES=3
export SPARK_WORKER_MEMORY=24g
export SPARK_DAEMON_MEMORY=256m


Step 2C: start master on master node: 

on host2: 
sudo /usr/local/paxata/spark/sbin/start-master.sh

Step 2D: start slave on each worker node: 

on host3: 
sudo /usr/local/paxata/spark/sbin/start-slave.sh spark://host2:7077

on host4:
sudo /usr/local/paxata/spark/sbin/start-slave.sh spark://host2:7077


3. MongoDB

Step 3A: Download Mongo 3.2.11 (Redhat/CentOS 6)

on host1:
(RHEL/CentOS 6.x only)
wget http://repo.mongodb.org/yum/redhat/6/mongodb-org/3.2/x86_64/RPMS/mongodb-org-3.2.11-1.el6.x86_64.rpm;
wget http://repo.mongodb.org/yum/redhat/6/mongodb-org/3.2/x86_64/RPMS/mongodb-org-mongos-3.2.11-1.el6.x86_64.rpm;
wget http://repo.mongodb.org/yum/redhat/6/mongodb-org/3.2/x86_64/RPMS/mongodb-org-server-3.2.11-1.el6.x86_64.rpm;
wget http://repo.mongodb.org/yum/redhat/6/mongodb-org/3.2/x86_64/RPMS/mongodb-org-shell-3.2.11-1.el6.x86_64.rpm;
wget http://repo.mongodb.org/yum/redhat/6/mongodb-org/3.2/x86_64/RPMS/mongodb-org-tools-3.2.11-1.el6.x86_64.rpm;

(RHEL/CentOS 7.x only)
wget http://repo.mongodb.org/yum/redhat/7/mongodb-org/3.2/x86_64/RPMS/mongodb-org-3.2.11-1.el7.x86_64.rpm;
wget http://repo.mongodb.org/yum/redhat/7/mongodb-org/3.2/x86_64/RPMS/mongodb-org-mongos-3.2.11-1.el7.x86_64.rpm;
wget http://repo.mongodb.org/yum/redhat/7/mongodb-org/3.2/x86_64/RPMS/mongodb-org-server-3.2.11-1.el7.x86_64.rpm;
wget http://repo.mongodb.org/yum/redhat/7/mongodb-org/3.2/x86_64/RPMS/mongodb-org-shell-3.2.11-1.el7.x86_64.rpm;
wget http://repo.mongodb.org/yum/redhat/7/mongodb-org/3.2/x86_64/RPMS/mongodb-org-tools-3.2.11-1.el7.x86_64.rpm;

Step 3B: Install Mongo:
on host1:
sudo yum localinstall --disablerepo=* mongodb*.rpm

Step 3C: Configure Mongo:
on host1:
sudo mkdir -p /usr/local/paxata/mongo; sudo chown mongod:mongod /usr/local/paxata/mongo;

vi /etc/mongod.conf

storage:
 dbPath: /usr/local/paxata/mongo
 journal:
   enabled: true
 engine: wiredTiger


Step 3D: start MongoDB:
on host1:
(RHEL/CentOS 6.x only)
service mongod restart
chkconfig mongod on

(RHEL/CentOS 7.x only)
systemctl restart mongod
systemctl enable mongod.service

4. Paxata Pipeline
Step 4A: Download Paxata Pipeline
on host2 (replace xxxxx and 2.17.x with actual value):

wget --user xxxxx --password xxxxx --no-check-certificate https://flash.paxata.com/2.17/paxata-pipeline-db1.6.1-2.17.x.rpm

Step 4B: Install Paxata Pipeline

on host2 (replace 2.17.x with actual value):
sudo yum localinstall paxata-pipeline-db1.6.1-2.17.x.rpm

Step 4C: Configure Pipeline (in this example, pipeline consumes 1 worker core and 8g executor memory, 20g disk space per worker)
on host2: 
cd /usr/local/paxata/pipeline/config/

sudo vi spark.properties

##Valid config options are spark://master:port or yarn-client
master.url=spark://host2:7077
spark.home=/usr/local/paxata/spark

##The config variables below are only required for Spark on YARN
# hadoop.conf=/etc/hadoop/conf
# yarn.num.executors=5
# yarn.executor.cores=1
# spark.yarn.jar=hdfs://paxcdh54yik/user/spark/share/lib/spark-assembly-1.3.0-cdh5.4.8-hadoop2.6.0-cdh5.4.8.jar
# spark.yarn.queue=spark

sudo vi paxata.properties

px.rootdir=/usr/local/paxata/cache
px.total.cache.capacity=75000
# The following are used by the pipeline startup script
px.xmx=4g
px.xms=1g
px.xx.MaxPermSize=256m
px.executor.memory=15g
px.partition.maxBytes=100000000
px.ulimit.min=1024

Step 4D: Start Pipeline

on host2: 

service paxata-pipeline restart;tailf /usr/local/paxata/pipeline/logs/pipeline.log

5. Paxata Core Server

Step 5A: Download Paxata Server
on host1 (replace xxxxx and 2.17.x with actual value):

wget --user xxxxx --password xxxxx --no-check-certificate https://flash.paxata.com/xxxxx/2.17/paxata-server-2.17.x.rpm

Step 5B: Install Paxata Server 
on host1 (replace xxxxx and 2.17.x with actual value):
sudo yum localinstall —disablerepo=* paxata-server-2.17.x.rpm

Step 5C: Configure Paxata Server

on host1: 

cd /usr/local/paxata/server/config

sudo vi jetty.properties
px.port=80
px.port.redirect=true
px.port.redirect.to=443
px.use.ssl=true
px.ssl.port=443
px.ssl.cert.alias=paxata
px.ssl.keystore=/usr/local/paxata/server/px.jks
px.ssl.keystore.password=paxata
px.ssl.truststore=/usr/local/paxata/server/px.jks
px.ssl.truststore.password=paxata

sudo vi px.properties
# Change localhost to the hostname of the Paxata Pipeline Server
px.pipeline.url=http://host2:8090

# Change localhost to the hostname of the Paxata Server
px.library.url=http://host1:9080/library

Step 5D: Create Java KeyStore File with Self Signed SSL Certificate

on host1: 
/usr/java/latest/bin/keytool -genkey -alias paxata -keyalg RSA -keysize 2048 -keystore /usr/local/paxata/server/px.jks -dname "CN=paxata,OU=IT,O=IT Admin,L=Redwood City,ST=CA,C=US"

password should match with px.ssl.keystore.password and px.ssl.truststore.password in /usr/local/paxata/server/config/jetty.properties. In this case it's paxata.

sudo chown paxata:paxata /usr/local/paxata/server/px.jks

Step 5E: Allow privileged port (80 & 443) to be used by Jetty Server

on host1: 

setcap 'cap_net_bind_service=+ep' /usr/java/latest/bin/java && echo '/usr/java/latest/lib/amd64/jli' > /etc/ld.so.conf.d/java.conf && ldconfig

Step 5F: Start Paxata Server

on host1:
service paxata-server restart;tailf /usr/local/paxata/server/logs/frontend.log

Watch the frontend.log and pipeline.log to see if there's any error starting up. If not, you should be able to use your browser to access Paxata UI on

https://host1 

Login as superuser/superuser (default username/password), and click About button on top right dropdown. You should see both Paxata Server and Pipeline Version. 

Now load your favorite dataset to a new project and start data preparation in development environment!

For production environment deployment, please contact [email protected] for more info.
Sign In or Register to comment.