{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Machine Learning (advanced): the Titanic dataset\n",
"\n",
"If you want to try out this notebook with a live Python kernel, use mybinder:\n",
"\n",
"\n",
"\n",
"\n",
"In the following is a more involved machine learning example, in which we will use a larger variety of method in `veax` to do data cleaning, feature engineering, pre-processing and finally to train a couple of models. To do this, we will use the well known _Titanic dataset_. Our task is to predict which passengers are more likely to have survived the disaster. \n",
"\n",
"Before we begin, thare there are two important notes to consider:\n",
" - The following example is not to provide a competitive score for any competitions that might use the _Titanic dataset_. It's primary goal is to show how various methods provided by `vaex` and `vaex.ml` can be used to clean data, create new features, and do general data manipulations in a machine learning context. \n",
" - While the _Titanic dataset_ is rather small in side, all the methods and operations presented in the solution below will work on a dataset of arbitrary size, as long as it fits on the hard-drive of your machine.\n",
" \n",
"Now, with that out of the way, let's get started!"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:37.005009Z",
"start_time": "2020-05-01T17:12:35.667407Z"
}
},
"outputs": [],
"source": [
"import vaex\n",
"import vaex.ml\n",
"\n",
"import numpy as np\n",
"import pylab as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Adjusting `matplotlib` parmeters\n",
"\n",
"_Intermezzo:_ we modify some of the `matplotlib` default settings, just to make the plots a bit more legible."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:37.014957Z",
"start_time": "2020-05-01T17:12:37.007951Z"
}
},
"outputs": [],
"source": [
"SMALL_SIZE = 12\n",
"MEDIUM_SIZE = 14\n",
"BIGGER_SIZE = 16\n",
"\n",
"plt.rc('font', size=SMALL_SIZE) # controls default text sizes\n",
"plt.rc('axes', titlesize=SMALL_SIZE) # fontsize of the axes title\n",
"plt.rc('axes', labelsize=MEDIUM_SIZE) # fontsize of the x and y labels\n",
"plt.rc('xtick', labelsize=SMALL_SIZE) # fontsize of the tick labels\n",
"plt.rc('ytick', labelsize=SMALL_SIZE) # fontsize of the tick labels\n",
"plt.rc('legend', fontsize=SMALL_SIZE) # legend fontsize\n",
"plt.rc('figure', titlesize=BIGGER_SIZE) # fontsize of the figure title"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get the data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First of all we need to read in the data. Since the _Titanic dataset_ is quite well known for trying out different classification algorithms, as well as commonly used as a teaching tool for aspiring data scientists, it ships (no pun intended) together with `vaex.ml`. So let's read it in, see the description of its contents, and get a preview of the data."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:37.069863Z",
"start_time": "2020-05-01T17:12:37.017532Z"
}
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"
Description: file exported by vaex, by user jovan, on date 2019-07-04 11:02:26.996867, from source /has/no/path/pandasprevious description:\n",
"\n",
"The Titanic dataset. \n",
"A classic dataset used in many data mining tutorials and demos. \n",
"Perfect for exploratory analysis and building binary classification models to predict survival.\n",
"\n",
"Data covers passengers only, not crew.\n",
"\n",
"Column description:\n",
"pclass = passenger class (1 = 1st; 2 = 2nd; 3 = 3rd)\n",
"survived = Survival (False = No; True = Yes)\n",
"name = Name\n",
"sex = Sex\n",
"sibsp = Number of Siblings/Spouses Aboard\n",
"parch = Number of Parents/Children Aboard\n",
"ticket = Ticket Number\n",
"fare = Passenger Fare\n",
"cabin = Cabin\n",
"embarked = Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)\n",
"boat = Lifeboat (if survived)\n",
"body = Body number (if did not survive and body was recovered)\n",
"home_dest = Passenger destination\n",
"
Columns:
column
type
unit
description
expression
pclass
int64
survived
bool
name
str
sex
str
age
float64
sibsp
int64
parch
int64
ticket
str
fare
float64
cabin
str
embarked
str
boat
str
body
float64
home_dest
str
Data:
\n",
"\n",
"
#
pclass
survived
name
sex
age
sibsp
parch
ticket
fare
cabin
embarked
boat
body
home_dest
\n",
"\n",
"\n",
"
0
1
True
Allen, Miss. Elisabeth Walton
female
29.0
0
0
24160
211.3375
B5
S
2
nan
St Louis, MO
\n",
"
1
1
True
Allison, Master. Hudson Trevor
male
0.9167
1
2
113781
151.55
C22 C26
S
11
nan
Montreal, PQ / Chesterville, ON
\n",
"
2
1
False
Allison, Miss. Helen Loraine
female
2.0
1
2
113781
151.55
C22 C26
S
None
nan
Montreal, PQ / Chesterville, ON
\n",
"
3
1
False
Allison, Mr. Hudson Joshua Creighton
male
30.0
1
2
113781
151.55
C22 C26
S
None
135.0
Montreal, PQ / Chesterville, ON
\n",
"
4
1
False
Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
female
25.0
1
2
113781
151.55
C22 C26
S
None
nan
Montreal, PQ / Chesterville, ON
\n",
"
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
\n",
"
1,304
3
False
Zabour, Miss. Hileni
female
14.5
1
0
2665
14.4542
None
C
None
328.0
None
\n",
"
1,305
3
False
Zabour, Miss. Thamine
female
nan
1
0
2665
14.4542
None
C
None
nan
None
\n",
"
1,306
3
False
Zakarian, Mr. Mapriededer
male
26.5
0
0
2656
7.225
None
C
None
304.0
None
\n",
"
1,307
3
False
Zakarian, Mr. Ortin
male
27.0
0
0
2670
7.225
None
C
None
nan
None
\n",
"
1,308
3
False
Zimmerman, Mr. Leo
male
29.0
0
0
315082
7.875
None
S
None
nan
None
\n",
"\n",
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Load the titanic dataset\n",
"df = vaex.ml.datasets.load_titanic()\n",
"\n",
"# See the description\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Shuffling\n",
"From the preview of the DataFrame we notice that the data is sorted alphabetically by name and by passenger class.\n",
"Thus we need to shuffle it before we split it into train and test sets."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:37.078118Z",
"start_time": "2020-05-01T17:12:37.072165Z"
}
},
"outputs": [],
"source": [
"# The dataset is ordered, so let's shuffle it\n",
"df = df.sample(frac=1, random_state=31)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Shuffling for large datasets\n",
"As mentioned in [The ML introduction tutorial](tutorial_ml_intro.ipynb), shuffling large datasets in-memory is not a good idea. In case you work with a large dataset, consider shuffling while exporting:\n",
"\n",
"```\n",
"df.export(\"shuffled\", shuffle=True)\n",
"df = vaex.open(\"shuffled.hdf5)\n",
"df_train, df_test = df.ml.train_test_split(test_size=0.2)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Split into train and test\n",
"Once the data is shuffled, let's split it into train and test sets. The test set will comprise 20% of the data. Note that we do not shuffle the data for you, since vaex cannot assume your data fits into memory, you are responsible for either writing it in shuffled order on disk, or shuffle it in memory (the previous step)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:37.128176Z",
"start_time": "2020-05-01T17:12:37.080094Z"
}
},
"outputs": [],
"source": [
"# Train and test split, no shuffling occurs\n",
"df_train, df_test = df.ml.train_test_split(test_size=0.2, verbose=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sanity checks\n",
"\n",
"Before we move on to process the data, let's verify that our train and test sets are \"similar\" enough. We will not be very rigorous here, but just look at basic statistics of some of the key features.\n",
"\n",
"For starters, let's check that the fraction of survivals is similar between the train and test sets."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:37.731294Z",
"start_time": "2020-05-01T17:12:37.129879Z"
}
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1QAAAEUCAYAAAAspncYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nO3deZhkZXn///eHGQRkGBEYMKCAuKEouIzBJaJGI1+3SIBvBFExiWL0i8tPElzRcSEKCkbFCLgBLgRRMCoaxQhuUZJxARwEBQRlUQeQgRkWWe7fH+e0FEV3z+mmuqt6+v26rrqmzvOc5a6emXr6PudZUlVIkiRJkqZuvWEHIEmSJElzlQmVJEmSJE2TCZUkSZIkTZMJlSRJkiRNkwmVJEmSJE2TCZUkSZIkTZMJleaMJF9Lsv+w4xgFSfZL8o2O+65I8pRpXueSJE+fzrEdz9/5c0iSZt9Mtb1JjkvyrkGft+f82yZZnWTBTF1DGmNCpRnVfpmNvW5PcmPP9n5TOVdVPbOqjp+pWPsl2T5JJVk4W9fsqqo+U1XP6LjvTlV15gyHtFbj/Tyn8jk6nH+zJKcmWZPk0iQvWMv+OyT5SpLrk1yV5PDpnkvS3DPI9qk935lJXjoDcb4kyfcGfd6uZrvtna7+G4BV9euqWlRVtw3g3ElyWJKr29fhSTLBvvv1/du6oW37HjPVc2nuGLlfFLVuqapFY++TXAK8tKq+2b9fkoVVdetsxjbK5uLPI8mCQTRcd8OHgT8CWwGPBE5LcnZVrejfMck9gNPbY54P3AY8eDrnkjQ3dW2f5qv2l/xU1e0jEMuw28QDgD2AXYCiaT8uBo7u37GqPgN8Zmw7yUuAQ4AfT/Vcmjt8QqWhSPKUJJcleX2S3wKfTHLv9onByiR/aN/ft+eYP939G7tjl+R97b6/SvLMSa73+iSXt08jLkjytLZ8vSRvSHJRe6foc0k2aw/7Tvvnte1dpsd3+Fx/nmR5kuuS/C7Jkb2ft2/fP91NS7IsyeeTfDrJdcCb2rulm/Xs/6j2Scr6vXcskxyd5H195/6PJK8b5zqTfV6SvKh9InN1kjev5bMel+QjSb6aZA3w1CTPTvKT9vP/JsmynkPu8vPsv/Oa5AlJ/jfJqvbPJ6ztZ94etzGwF3BIVa2uqu8BXwJeNMEhLwGuqKojq2pNVd1UVedM81yS1iGTfU8m2bD9nr46ybXt99RWSQ4FngQc1X6/HTXOecc9tq27V5KPJ7mybavelWRBkofS/KL9+Pa813b8DC9JcnHb5v0q7RO3tq35dM9+d+o50Lazhyb5PnADsMNY25tkgzbuh/ccv6Rtq7Zst5+T5Kftfv+dZOeefR+V5MdtTCcBG64l/u8neX+Sa4BlSR6Q5Fvtz++qJJ9Jsmm7/6eAbYEvtz+ng8f5bFsn+VKSa5JcmORlXX6Wrf2BI6rqsqq6HDiCph3peuwJVVUDOJdGlAmVhuk+wGbAdjR3bNYDPtlubwvcCNylUeqxK3ABsAVwOPDx5K6PzZM8BDgQeGxVbQLsDlzSVr+a5k7Rk4GtgT/QPJ0A2K39c9O228AP0vTJvjbJthPE9AHgA1W1GHgA8LlJfwJ39jzg88CmwHuBH9D8Yj/mBcDnq+qWvuM+Czx/7LMnuTfwDODfx7nGhJ83ycOAj9AkDlsDmwP3HeccvV4AHApsAnwPWAO8uP0MzwZekWSPdt+7/Dx7T9T+wnIa8MH22kfSPBnavK1/Q5KvTBDHg4HbquoXPWVnAztNsP/jgEvSjA24qv2F4RHTPJekdctk7cL+wL2A+9F8T/0jcGNVvRn4LnBg+/124DjnHffYtu544FbggcCjaL7DX1pVP2/3+0F73rEE4gVJzhkv+Pam0AeBZ7Zt3hOAn07h87+Ipk3eBLh0rLCqbgZOAfbt2fdvgW9X1e+TPBr4BPDy9vMdA3ypTcTuAXwR+BRNu38yd27fxrMrzZObLWnamQDvpvk7eSjNz3FZG9uLgF8Dz21/ToePc74Tgcva4/cG/iV33Fz9i7UkqzvRtANjOrUJSbajaftOuLvn0mgzodIw3Q68rapurqobq+rqqvpCVd1QVdfTfIE+eZLjL62qj7bdzI4H/oymi1a/24ANgIclWb+qLqmqi9q6lwNvbu8U3Uzz5bx3Jhg31fbJ3rSqfj1BTLcAD0yyRft044dr+Rn0+kFVfbGqbq+qG2kSpX3hT10v9mnL+n2XptvAk9rtvdtzXTHOvpN93r2Br1TVd9q6Q2j+jibzH1X1/Tbmm6rqzKo6t90+h6YBm+zvsNezgV9W1aeq6taqOhE4H3guQFW9p6qeM8Gxi4BVfWWraH4hGM99aX6eH6RpXE8D/qNt9Kd6Lknrlsm+J2+hSRYeWFW3VdWPquq6jucd99j2KdUzgde2T8x/D7yf5jtqXFX12araeaJ6mu/uhyfZqKqunGJ35eOqakX7PTzeDbzehOoF3NEuvQw4pqrOaj/f8cDNNDewHgesD/xrVd1SVZ8H/nctcVxRVR9q47ixqi6sqtPb3xlW0tx069S+JLkf8BfA69u26qfAx2h7HlTV98aS1Qn0twurgEXj3cTt82Lgu1X1qwGcSyPMhErDtLKqbhrbSHLPJMek6XJ2HU0XsU0z8Qw9vx17U1U3tG8X9e9UVRcCr6VpFH+f5N+TbN1Wbwec2j51uhb4OU0CNl5i1sU/0DzhOL/tzjFRAjCe3/Rtf56mm8fWNHe4iiZ5upO2G8G/c0cj9wJ6+m/3mezzbt0bQ1WtAa6eSsxJdk1yRppum6to7qxusZZzjNmanruhrUuBbTocuxpY3Fe2GLh+gv1vBL5XVV+rqj8C76P5Reeh0ziXpHXLZN+TnwK+Dvx7kivSTCiwfsfzTnTsdjTJxpU91zyG5snMlLXf3c+n+f69MslpSXacwin626Je3wI2ar/rt6MZY3pqW7cdcNDYZ2g/x/1ovtu3Bi7v6fYGd/2+nzSOJFu27ffl7e8In2Zq7cs17c3a3ut3aV/gru3CYmB13+cZz4tpbvgO4lwaYSZUGqb+L4+DgIcAu7Zd5sa6iN3tuzbt3by/oPnCL+Cwtuo3NN0iNu15bdj2a57yl1tV/bKq9qVpCA8DPt92v1gD3HNsvzZJXNJ/eN+5rgW+QdOl4gXAiZN84Z5Icwd1O5puEl+YYL/JPu+VNI3fWIz3pEkyJv3IfdufpRlvdL+quhdN3/9MsG+/K2j+fnptC1y+luMAfgEsTPKgnrJdgInuyp4zSTxTPZekdcuE35Pt05W3V9XDaLrSPYfml2ZYy3fcJMf+huZJzhY911tcVWPdwKbTFn29qv6KpufG+cBH26o7tUU0Xe/vcvgk572dpiv7vjTt0ld6kpTfAIf2/dzu2fY2uBLYpu8pzERd5yeK491t2c7t7wgv5M6/H0z2c7oC2CxJb0+Dru0LNN//u/Rsr7VNSPJEmkTu83f3XBp9JlQaJZvQPDm4th1P87ZBnDTJQ5L8ZZINgJvaa4zNRnc0cGibiIwNsH1eW7eSptvEDlO41guTLGkbnbH+2LfR/JK+YZpJG9YH3kLTDXFtPkvT4O7F+N39AKiqn7Txfgz4epuMjWeyz/t54DltX/J7AO9g6t8Rm9DcBbwpyZ/TNLhj1vbz/Crw4HZswMIkzwceBkw0bupP2juypwDvSLJx25A9j+aO8Hg+DTwuydPb5Pa1wFXAz6dxLknrlgm/J5M8Nckj2u+N62i68Y21J79jkvZiomOr6kqam2dHJFmcZlKMByQZ6872O+C+7ffyWqWZJOOv25t5N9M8ERmL8afAbmnGA98LeGPHn0mvz9I8AduPO7dLHwX+sX16lfb789ltEvMDmjFir26/3/cE/nyK192k/SzXJtkG+Oe++gl//lX1G+C/gXenmRxkZ5oeJRP15uh3AvC6JNu0vUYOAo5byzH7A1/oeyo23XNpxJlQaZT8K7ARzS+2PwT+c0Dn3QB4T3ve39I8PXpTW/cBmicq30hyfXvdXeFP3QgPBb7fdl94XO5YKHCiO2v/B1iRZHV77n3a/tqrgFfSJDyX09wlvGyCc/T6EvAg4HdVdfZa9j0ReDqTJF5r+bwrgP/XHn8lzUDsLjH2eiVNInI98FZ6JuUY7+fZe2BVXU1zx/Ygmq6GBwPPqaqrAJK8KcnX1nLtjYDf0/wsXjE2bqD/762qLqC5u3l0+zmfB/x12/1v0nNJWudN+D1J80Tn8zQJ0c+Bb9PcoBk7bu80M89+cJzzTnbsi4F7AOfRfCd9nubpEjTd7FYAv00y9n24X5KJvpPWo/kevQK4hmac0SsBqup04CSap/Q/osMNq35VdRZNG7Y18LWe8uU046iOaj/DhbSz17XfrXu223+gSchOmeKl3w48mmbM0WnjHP9u4C1t+/JP4xy/L7A9zc/lVJox3KcDJHlS225P5Bjgy8C5wM/a6x8zVplkRXrWLkuyIU3vkvHW75r0XJqbYpdNSZIkSZoen1BJkiRJ0jSZUEmSJEnSNJlQSZIkSdI0mVBJkiRJ0jSZUEmSJEnSNC0cdgAzaYsttqjtt99+2GFIkgbkRz/60VVV1b8o9kiyDZKkdcdk7c86nVBtv/32LF++fNhhSJIGJMmlw46hK9sgSVp3TNb+2OVPkiRJkqbJhEqSJEmSpsmESpIkSZKmyYRKkiRJkqbJhEqSJEmSpsmESpIkSZKmyYRKkiRJkqbJhEqSJEmSpmmdXth3Ltv+DacNO4R555L3PHvYIUjS0Nn+DIdtkDR3+YRKkrTOS3JgkuVJbk5yXE/59kkqyeqe1yE99UlyWJKr29fhSTKUDyFJGkk+oZIkzQdXAO8Cdgc2Gqd+06q6dZzyA4A9gF2AAk4HLgaOnqE4JUlzjE+oJEnrvKo6paq+CFw9xUP3B46oqsuq6nLgCOAlg45PkjR3mVBJkgSXJrksySeTbNFTvhNwds/22W2ZJEmACZUkaX67CngssB3wGGAT4DM99YuAVT3bq4BFE42jSnJAO1Zr+cqVK2coZEnSKDGhkiTNW1W1uqqWV9WtVfU74EDgGUkWt7usBhb3HLIYWF1VNcH5jq2qpVW1dMmSJTMbvCRpJJhQSZJ0h7FEaewJ1AqaCSnG7NKWSZIEmFBJkuaBJAuTbAgsABYk2bAt2zXJQ5Ksl2Rz4IPAmVU11s3vBOB1SbZJsjVwEHDcUD6EJGkkmVBJkuaDtwA3Am8AXti+fwuwA/CfwPXAz4CbgX17jjsG+DJwblt/WlsmSRLgOlSSpHmgqpYByyaoPnGS4wo4uH1JknQXPqGSJEmSpGkyoZIkSZKkaZr1hCrJPkl+nmRNkouSPKktf1qS85PckOSMJNv1HJMkhyW5un0dPtEaIJIkSZI0W2Y1oUryV8BhwN/RLJ64G3Bxuyr9KcAhwGbAcuCknkMPAPagma52Z+A5wMtnL3JJkiRJuqvZfkL1duAdVfXDqrq9qi6vqsuBPYEVVXVyVd1EM3B4lyQ7tsftDxxRVZe1+x8BvGSWY5ckSZKkO+mUUCVZkmRJz/Yjkrwryb6THdd3jgXAUmBJkguTXJbkqCQbATsBZ4/tW1VrgIvacvrr2/c7MY4kByRZnmT5ypUru4YnSZIkSVPW9QnV54DnArTd874D/A1wdJKDOp5jK2B9YG/gScAjgUfRrAOyCFjVt/8qmm6BjFO/Clg03jiqqjq2qpZW1dIlS5b0V0uSJEnSwHRNqHYGfti+3xu4sKp2Al5M97FMN7Z/fqiqrqyqq4AjgWcBq4HFffsvpllokXHqFwOr2/VBJEmSJGkouiZUG9EkNQBPB77Uvv8xcL8uJ6iqPwCXAeMlQStoJpwAIMnGwAPa8rvUt+9XIEmSJElD1DWh+iWwZ5L7Ac8AvtGWbwVcO4XrfRJ4VZItk9wbeC3wFeBU4OFJ9kqyIfBW4JyqOr897gTgdUm2SbI1cBBw3BSuK0mSJEkD1zWhejvNdOeXAD+sqrPa8t2Bn0zheu8E/hf4BfDz9thDq2olsBdwKPAHYFdgn57jjgG+DJwL/Aw4rS2TJEmSpKFZ2GWnqjolybbA1tx5tr1vAl/oerGqugV4Zfvqr/smsONdDmrqCji4fUmSJEnSSFjrE6ok6yf5LbBFVf2kqm4fq6uqs3q65UmSJEnSvLLWhKp9qnQL408mIUmSJEnzVtcxVB8C3pikUxdBSZIkSZoPuiZITwKeDFye5GfAmt7KqvrrQQcmSZIkSaOua0J1FVOYfEKSJEmS5oOus/z93UwHIkmSJElzTdcxVAAkWZrk+Uk2brc3dlyVJEmSpPmqUzKUZCvgS8BjaWb7exBwMXAkcBPwmpkKUJIkSZJGVdcnVO8HfgtsDtzQU34y8IxBByVJkiRJc0HX7npPA55WVX9I0lt+EbDtwKOSJEmSpDmg6xOqjYA/jlO+hKbLnyRJkiTNO10Tqu8AL+nZriQLgNcD/zXooCRJGqQkByZZnuTmJMf1lD8uyelJrkmyMsnJSf6sp35ZkluSrO557TCUDyFJGkldE6qDgZclOR3YADgCOA94IvDGGYpNkqRBuQJ4F/CJvvJ7A8cC2wPbAdcDn+zb56SqWtTzunimg5UkzR1d16E6L8kjgFcANwMb0kxI8eGqunIG45Mk6W6rqlOgWf4DuG9P+dd690tyFPDt2Y1OkjSXdV5Dqqp+C7xtBmORJGnYdgNW9JU9N8k1wJXAUVX1kYkOTnIAcADAtts6Z5MkzQdd16HabYKqopmU4qKqumZgUUmSNMuS7Ay8FXheT/HnaLoE/g7YFfhCkmur6sTxzlFVx7b7s3Tp0prZiCVJo6DrE6ozaZIngLF503u3b0/yJeBFVbVmcOFJkjTzkjwQ+Brwmqr67lh5VZ3Xs9t/J/kAsDcwbkIlSZp/uk5K8Wzg58ALgQe2rxfSdIvYq309EnjPDMQoSdKMSbId8E3gnVX1qbXsXtxxY1GSpM5PqN5Fc9eud4r0i5OsBA6rqsckuQ34EPCqQQcpSdLdkWQhTZu3AFiQZEPgVmAr4Fs0kywdPc5xz6NZOuRa4LHAq4E3zVbckqTR1zWhehhw+Tjll7d1AOcC9xlEUJIkDdhbuPPESi8E3k7zxGkH4G1J/lRfVYvat/vQTLW+AXAZzU3E42clYknSnNC1y995wJuTbDBW0L5/U1sHcD/gt5OdJMmZSW7qWRzxgp66pyU5P8kNSc5ou2CM1SXJYUmubl+HJ7HLhSSpk6paVlXpey2rqre373vXmVrUc9y+VbV5W75jVX1wmJ9DkjR6uiZUrwR2By5vk6IzaJ5O7U6zNhU0d/j+rcO5DuxptB4CkGQL4BTgEGAzYDlwUs8xBwB7ALsAOwPPAV7eMXZJkiRJmhFdF/Y9K8n9abpIPIRmQO6JwGfGZvWrqhPuRhx7Aiuq6mSAJMuAq5LsWFXnA/sDR1TVZW39EcDLgLv0d5ckSZKk2TKVhX3XAMcM4JrvTvIe4ALgzVV1JrATcHbvtZJc1Jaf31/fvt9pALFIkiRJ0rR1TqiS3A94ErAlfV0Fq+rIjqd5Pc2Yqz/SDPT9cpJHAouAlX37rgI2ad8vard76xYlSVXdaeFEV6mXJEmSNFs6JVRJ9qOZ5ehWmsSnN4kpoFNCVVVn9Wwen2Rf4FnAamBx3+6Lgevb9/31i4HV/clUew1XqZckSZI0K7pOSvEO4AhgcVVtX1X373ntcDeuP7ZA4gqaCScASLIx8IC2nP769v0KJEmSJGmIuiZUWwEfq6rbpnuhJJsm2T3JhkkWtk+9dgO+DpwKPDzJXu1ii28FzmknpAA4AXhdkm2SbA0cBBw33VgkSZIkaRC6jqH6KrArcPHduNb6wLuAHYHbaCab2KOqLgBIshdwFPBp4CyaMVZjjqGZlv3cdvtjDGaCDEmSJEmatq4J1enAYUl2oklqbumtrKpT1naCqloJPHaS+m/SJFvj1RVwcPuSJEmSpJHQNaEaexr0pnHqClgwmHAkSZIkae7ourBv17FWkiRJkjRvmChJkiRJ0jR1SqjSeGWSFUluSLJDW/6GJH87syFKkiRJ0mjq+oTqNcBbaBbMTU/55cCBgw5KkiRJkuaCrgnVPwIvq6oPALf2lP8Y2GngUUmSJEnSHNB1lr/tgJ+NU34LsNHgwpEkSZJm3vZvOG3YIcxLl7zn2cMOYeC6PqG6GHj0OOXPAs4bXDiSJEmSNHd0fUL1PuCoJPekGUP1+CQvollo9+9nKjhJkiRJGmVd16H6ZJKFwL8A9wQ+RTMhxaur6qQZjE+SJEmSRlbXJ1RU1UeBjybZAlivqn4/c2FJkiRJ0ujrug7VeknWA6iqq4D1krw0yRNmNDpJkiRJGmFdJ6U4DXgVQJJFwHLgvcC3k7x4hmKTJEmSpJHWNaF6DPCt9v2ewHXAlsDLgH+agbgkSRqYJAcmWZ7k5iTH9dU9Lcn5SW5IckaS7XrqkuSwJFe3r8OT5C4XkCTNW10Tqk2Aa9v3zwBOrapbaJKsB8xEYJIkDdAVwLuAT/QWtuOCTwEOATaj6YHRO9nSAcAewC7AzsBzgJfPQrySpDmia0L1a+CJSTYGdgdOb8s3A26YicAkSRqUqjqlqr4IXN1XtSewoqpOrqqbgGXALkl2bOv3B46oqsuq6nLgCOAlsxS2JGkO6JpQHUkzVfplNNOlf6ct3w04dwbikiRpNuwEnD22UVVrgIva8rvUt+93YgJJDmi7Fi5fuXLlDIQrSRo1nRKqqjoGeDzNIr5/UVW3t1UX0XSTkCRpLloErOorW0XT1X28+lXAoonGUVXVsVW1tKqWLlmyZODBSpJGz1TWoVpO07ccgCTrV9VpMxKVJEmzYzWwuK9sMXD9BPWLgdVVVbMQmyRpDui6DtWrk+zVs/1x4MYkFyR5yIxFJ0nSzFpBM+EEAO1Y4Qe05Xepb9+vQJKkVtcxVK8GVgIk2Q34W+AFwE9pBuhKkjSykixMsiGwAFiQZMMkC4FTgYcn2autfytwTlWd3x56AvC6JNsk2Ro4CDhuCB9BkjSiuiZU2wCXtO+fC5xcVZ+jmQ3pcVO9aJIHJbkpyad7ylwHRJI0U94C3Ai8AXhh+/4tVbUS2As4FPgDsCuwT89xxwBfppmA6Wc0C90fM3thS5JGXdeE6jpgbHTtXwH/1b6/BdhwGtf9MPC/YxuuAyJJmklVtayq0vda1tZ9s6p2rKqNquopVXVJz3FVVQdX1Wbt62DHT0mSenVNqL4BfLQdO/VA4Gtt+U7Ar6ZywST70CwS/F89xa4DIkmSJGnO6ZpQ/T/g+8AWwN5VdU1b/mjgxK4XS7IYeAdNH/ReA1sHxDVAJEmSJM2WTtOmV9V1wKvGKX/bFK/3TuDjVfWbviFQi2gnvejRaR2Q/q4XVXUscCzA0qVL7ZYhSZIkacZ0XodqTJL7APfoLauqX3c47pHA04FHjVPtOiCSJEmS5pxOCVWSewEfpJku/R7j7LKgw2meAmwP/Lp9OrWIZurahwFH04yTGrveROuA/E+77Tog0jpi+ze4Pvhsu+Q9zx52CJIkrTO6jqF6H00SswdwE80aVP8MXAY8v+M5jqVJkh7Zvo6mmX52d1wHRJIkSdIc1LXL3zOBfavqu0luA35UVScluZJm+vLPr+0EVXUDcMPYdpLVwE3tGiAk2Qs4Cvg0cBZ3XQdkB5p1QAA+huuASJIkSRqyrgnVpsCl7ftVwObAhcAPaJKbKRtb/6Nn+5vAjhPsW8DB7UuSJEmSRkLXLn8X0TwhAvg5sE+agVB7AtdMeJQkSZIkrcO6JlTHATu3799D083vj8B7gcMGH5YkSZIkjb6u61C9v+f9t5I8FHgM8MuqOnfiIyVJkiRp3TXldagAqupS7hhTJUmSJEnzUtcufyTZI8l3klzVvr6b5G9mMjhJkiRJGmWdEqokBwEnARdwx2x75wOfTfJPMxeeJEmSJI2url3+/gk4sKo+2lP2iST/A7yDZuFfSZIkSZpXunb5WwScMU75GW2dJEmSJM07XROqLwJ7j1O+F/ClwYUjSZIkSXNH1y5/FwJvSPJU4Adt2ePa15FJXje2Y1UdOdgQJUmSJGk0dU2oXgL8AXhw+xrzB+DverYLMKGSJEmSNC90Xdj3/jMdiCRJkiTNNZ3XoZIkSZIk3ZkJlSRJkiRNkwmVJGleS7K673Vbkg+1ddsnqb76Q4YdsyRpdHSdlEKSpHVSVf1pPcUkGwO/A07u223Tqrp1VgOTJM0JEz6hSvKJJJu073dLYvIlSVrX7Q38HvjusAORJM0Nk3X5eyGwcfv+DGCzmQ9HkqSh2h84oaqqr/zSJJcl+WSSLYYRmCRpNE321OkS4FVJvgEEeHySP4y3Y1V9ZwZikyRp1iTZFngy8A89xVcBjwV+CmwOfBj4DLD7BOc4ADgAYNttt53JcCVJI2KyhOqfgY8Cb6RZsPfUCfYrYMGA45Ikaba9GPheVf1qrKCqVgPL283fJTkQuDLJ4qq6rv8EVXUscCzA0qVL+59ySZLWQRN2+auq/6iqLWm6+gXYCVgyzmvLrhdL8ukkVya5Lskvkry0p+5pSc5PckOSM5Js11OXJIclubp9HZ4kU/60kiRN7MXA8WvZZyxJsg2SJAEdZvmrqmuTPBX45QBmOHo38A9VdXOSHYEzk/wEuBQ4BXgp8GXgncBJwOPa4w4A9gB2oWnMTgcuBo6+m/FIkkSSJwDb0De7X5JdgWuBXwL3Bj4InFlVq2Y9SEnSSOo0c19VfTvJBkleDDyMJqk5D/hsVd3c9WJVtaJ3s309AHgMsKKqTgZIsgy4KsmOVXU+zSDhI6rqsrb+COBlmFBJkgZjf+CUqrq+r3wH4F9oemNcR3NDb99Zjk2SNMI6Leyb5GHAL4AjgV1pnhy9H/hFkodO5YJJ/i3JDcD5wJXAV2m6E549tk9VrQEuasvpr2/f78Q4khyQZHmS5StXrpxKaJKkeaqqXl5VLxqn/MSqun9VbVxVf1ZVL66q3w4jRknSaOqUUAEfoJnhaNuqelJVPQnYliax+depXLCqXglsAjyJppvfzcAioL/7xKp2P8apXwUsGm8cVVUdW1VLq2rpkiVLphKaJEmSJE1J18V6nwg8tndGoy8Oq9wAABDwSURBVKq6LsmbgR9O9aJVdRvwvSQvBF4BrAYW9+22GBjretFfvxhYPc46IZIkSZI0a7o+oboJ2HSc8nu1ddO1kGYM1QqaCScASLJxTzn99e373vFYkiRJkjTruiZUXwY+muSJSRa0r78AjgG+1OUESbZMsk+SRe3xu9MM7P0WzRpXD0+yV5INgbcC57QTUgCcALwuyTZJtgYOAo7r/CklSZIkaQZ07fL3Gpq1Ob4L3NaWrUeTTL224zmKpnvf0e2xlwKvrar/AEiyF3AU8GngLGCfnmOPoZlp6dx2+2NtmSRJkiQNTddp068FnpfkgcBDaRY0PK+qLux6oapaCTx5kvpvAjtOUFfAwe1LkiRJkkZC1ydUALQJVOckSpIkSZLWZV3HUEmSJEmS+phQSZIkSdI0mVBJkiRJ0jStNaFKsjDJK9vpyiVJkiRJrbUmVFV1K/BeYP2ZD0eSJEmS5o6uXf5+CDx6JgORJEmSpLmm67TpHwWOSLId8CNgTW9lVf140IFJkiRJ0qjrmlB9tv3zyHHqClgwmHAkSZIkae7omlDdf0ajkCRJkqQ5qFNCVVWXznQgkiRJkjTXdF6HKskzk3wlyXlJ7teWvTTJ02YuPEmSJEkaXZ0SqiT7AZ8DfknT/W9sCvUFwMEzE5okSZIkjbauT6gOBl5WVf8fcGtP+Q+BRw48KkmSJEmaA7omVA8CfjBO+Wpg8eDCkSRJkqS5o2tCdQXw4HHKdwMuGlw4kiRJkjR3dE2ojgU+mOSJ7fb9kuwPHA58ZEYikyRpliQ5M8lNSVa3rwt66p6W5PwkNyQ5o13kXpIkoGNCVVWHA6cApwMbA2cARwNHV9WHZy48SZJmzYFVtah9PQQgyRY07d8hwGbAcuCkIcYoSRoxXRf2parenORQ4GE0idh5VbV6xiKTJGn49gRWVNXJAEmWAVcl2bGqzh9qZJKkkdB5HapWATcBNwC3DT4cSZKG5t1Jrkry/SRPact2As4e26Gq1tCMHd5pCPFJkkZQ13WoNkjyr8A1NA3LOcA1ST6QZMMpnOPjSS5Ncn2SnyR5Zk/9hH3U0zgsydXt6/AkmdpHlSRpQq8HdgC2oRk3/OUkDwAWAav69l0FbDLeSZIckGR5kuUrV66cyXglSSOi6xOqjwB7Ay+lmUL9ge37vwH+reM5FgK/AZ4M3IumP/rnkmzfoY/6AcAewC7AzsBzgJd3vK4kSZOqqrOq6vqqurmqjge+DzyL8ZcHWQxcP8F5jq2qpVW1dMmSJTMbtCRpJHQdQ/V/gT2r6vSesouT/B74AvD3aztB201iWU/RV5L8CngMsDmT91HfHziiqi5r648AXkYzMYYkSYNWQIAVNG0QAEk2Bh7QlkuS1PkJ1Rrg8nHKLwdunM6Fk2xFs7bVCtbeR/1O9e17+69Lku62JJsm2T3JhkkWJtmPZp3FrwOnAg9Pslfbxf2twDlOSCFJGtM1ofoQ8LYkG40VtO8PaeumJMn6wGeA49tGaW191PvrVwGLxhtHZf91SdIUrQ+8C1gJXAW8Ctijqi6oqpXAXsChwB+AXYF9hhWoJGn0TNjlL8mX+oqeAlye5Jx2+xHt8RtP5YJJ1gM+BfwROLAtXlsf9f76xcDqqqr+81fVsTQDilm6dOld6iVJ6tUmTY+dpP6bwI6zF5EkaS6ZbAzV1X3bX+jb/tVUL9Y+Ufo4sBXwrKq6pa1aWx/1FTQTUvxPu70L9l+XJEmSNGQTJlRV9XczcL2PAA8Fnl5VvWOvTgXem2Qv4DTu2kf9BOB1Sb5KM1D4IKbR1VCSJEmSBmmqC/tOW7uu1MuBRwK/TbK6fe3XoY/6McCXgXOBn9EkXcfMVuySJEmSNJ5O06YnuTfNlOdPBbakLxGrqi3Xdo6qupRmCtqJ6ifso96OlTq4fUmSJEnSSOi6DtUJNNOUHw/8jqbbnSRJkiTNa10TqqcAT66qH89gLJIkSZI0p3QdQ3XRFPaVJEmSpHmha5L0GuDdSXZJsmAmA5IkSZKkuaJrl78LgY2AHwM0y0ndoapMsiRJkiTNO10TqhOBewGvxkkpJEmSJAnonlAtBf68qn42k8FIkiRJ0lzSdQzVecDimQxEkiRJkuaargnVW4Ajkzw9yVZJNut9zWSAkiRJkjSqunb5+2r75ze48/iptNtOSiFJkiRp3umaUD11RqOQJEmSpDmoU0JVVd+e6UAkSZIkaa7plFAlefRk9VX148GEI0mSJElzR9cuf8tpxkr1rujbO5bKMVSSJEmS5p2uCdX9+7bXBx4FvBl440AjkiRJkqQ5ousYqkvHKb4wySrgbcDXBhqVJEmSJM0BXdehmsivgEcOIhBJkiRJmmu6TkrRv3hvgD8DlgEXDDgmSZIkSZoTuo6huoo7T0IBTVL1G+D5A41IkiRJkuaI6S7sezuwEriwqm4dbEiSJM2eJBsA/wY8HdgMuBB4U1V9Lcn2NN3b1/QcclhVvXO245QkjSYX9pUkzXcLaXpcPBn4NfAs4HNJHtGzz6beQJQkjWfSSSmSbNbl1fViSQ5MsjzJzUmO66t7WpLzk9yQ5Iwk2/XUJclhSa5uX4cnyV0uIEnSFFXVmqpaVlWXVNXtVfUVmqdSjxl2bJKk0be2Wf6uounaN9nr91O43hXAu4BP9BYm2QI4BTiEprvFcuCknl0OAPYAdgF2Bp4DvHwK15UkqZMkWwEPBlb0FF+a5LIkn2zbrImOPaC9cbh85cqVMx6rJGn41tblr3/sVK//A7wG6NwFoqpOAUiyFLhvT9WewIqqOrmtXwZclWTHqjof2B84oqoua+uPAF4GHN312pIkrU2S9YHPAMdX1flJFgGPBX4KbA58uK3ffbzjq+pY4FiApUuX9k/mJElaB02aUI03dirJo4HDgN2AY4BBDMzdCTi757prklzUlp/fX9++32m8EyU5gOaJFttuu+0AQpMkzQdJ1gM+BfwROBCgqlbT9JoA+F2SA4ErkyyuquuGE6kkaZR0Xtg3yf2TfBY4C7gGeFhVvbqqBtGnYRGwqq9sFbDJBPWrgEXjjaOqqmOramlVLV2yZMkAQpMkreva9uTjwFbAXlV1ywS7jj11chyvJAnokFAl2TzJB2ieFN0HeHxVPb+qLhpgHKuBxX1li4HrJ6hfDKyuKrtTSJIG4SPAQ4HnVtWNY4VJdk3ykCTrJdkc+CBwZlX13wSUJM1Ta5vl703ARTRTyT6vqv6yqpZPdsw0raCZcGLsuhsDD+COAcF3qm/f9w4WliRpWtpZZV8OPBL4bZLV7Ws/YAfgP2lu8P0MuBnYd2jBSpJGztompXgXcCNwGfDKJK8cb6eq+usuF0uysL3mAmBBkg1pJrU4FXhvkr2A04C3Aue0E1IAnAC8LslXabpbHAR8qMs1JUmaTFVdyuRd+E6crVgkSXPP2hKqE7ijv/ggvAV4W8/2C4G3V9WyNpk6Cvg0zTitfXr2O4bmLuG57fbH2jJJkiRJGpq1zfL3kkFerKqWAcsmqPsmsOMEdQUc3L4kSZIkaSR0nuVPkiRJknRnJlSSJEmSNE0mVJIkSZI0TSZUkiRJkjRNJlSSJEmSNE0mVJIkSZI0TSZUkiRJkjRNJlSSJEmSNE0mVJIkSZI0TSZUkiRJkjRNJlSSJEmSNE0mVJIkSZI0TSZUkiRJkjRNJlSSJEmSNE0mVJIkSZI0TSZUkiRJkjRNJlSSJEmSNE0mVJIkSZI0TSZUkiRJkjRNJlSSJEmSNE1zJqFKslmSU5OsSXJpkhcMOyZJ0vxgGyRJmsjCYQcwBR8G/ghsBTwSOC3J2VW1YrhhSZLmAdsgSdK45sQTqiQbA3sBh1TV6qr6HvAl4EXDjUyStK6zDZIkTWZOJFTAg4HbquoXPWVnAzsNKR5J0vxhGyRJmtBc6fK3CFjVV7YK2KR/xyQHAAe0m6uTXDDDsenOtgCuGnYQ05HDhh2B5hj/rQ/HdkO4pm3Q3OH/S80X/luffRO2P3MloVoNLO4rWwxc379jVR0LHDsbQemukiyvqqXDjkOaaf5bn1dsg+YI/19qvvDf+miZK13+fgEsTPKgnrJdAAcDS5Jmmm2QJGlCcyKhqqo1wCnAO5JsnOSJwPOATw03MknSus42SJI0mTmRULVeCWwE/B44EXiF09WOJLu6aL7w3/r8Yhs0N/j/UvOF/9ZHSKpq2DFIkiRJ0pw0l55QSZIkSdJIMaGSJEmSpGkyoZIkSZKkaTKhkqQOkmyQ5NAkFydZ1ZY9I8mBw45NkrTusv0ZfSZUGogkD01ySJIPt9s7Jtl52HFJA/R+4OHAfsDYbD4rgFcMLSJJgG2Q1nm2PyPOhEp3W5L/C3wb2AZ4UVu8CDhyaEFJg/c3wAuq6gfA7QBVdTnNv3tJQ2IbpHnA9mfEmVBpEN4BPKOq/hG4rS07G9hleCFJA/dHYGFvQZIlwNXDCUdSyzZI6zrbnxFnQqVB2JKm8YI7HkVXz3tpXXAycHyS+wMk+TPgKODfhxqVJNsgretsf0acCZUG4Ufc0c1izD7A/wwhFmmmvAm4BDgX2BT4JXAF8PYhxiTJNkjrPtufEZcqb+Do7kmyI/AN4FfA44AzgQfTdMH45RBDk2ZE29XiqvILVBo62yDNJ7Y/o8mESgOR5J7Ac4DtgN8AX6mq1cONShqcJDtMVFdVF89mLJLuzDZI6zLbn9FnQqWBa//j31ZVlw47FmlQktxOMyYjPcUFUFULhhKUpLuwDdK6xvZn9DmGSndbkhOTPKF9/3c0ayOcl+QfhhuZNDhVtV5VLWj/XA/YGjiWu47dkDSLbIO0rrP9GX0+odLdluT3wH2r6o9JzgX+EbgW+GJVPWi40UkzJ8kGwC+qarthxyLNV7ZBmo9sf0bLwrXvIq3VPdqGbBtgs6r6PkCSrYYclzTTHgLcc9hBSPOcbZDmI9ufEWJCpUH4aZI30gwGPg2gbdiuG2pU0gAl+S53XtfmnsBONIuKShoe2yCt02x/Rp8JlQbhH4B3ArcA/9yWPR74zNAikgbvY33ba4CznZZZGjrbIK3rbH9GnGOoJGktkiwAPgEcUFU3DzseSdL8YPszN5hQaVqS/H2X/arqEzMdizQbklwJbFtVtww7Fmm+sw3SfGL7M/pMqDQtSc7osFtV1V/OeDDSLEhyMLAp8DYbNWm4bIM0n9j+jD4TKkmaRJJ9q+rEJL8B7gPcBqykZ4BwVW07rPgkSesm25+5w4RKA5Uk9KzkXVW3DzEc6W5Lcl1VLU7y5In2qapvz2ZMksZnG6R1ie3P3GFCpbutnZ72KGA3mkfSf1JVC4YSlDQgSa6vqk2GHYek8dkGaV1l+zN3OG26BuFo4AbgacC3aRq1ZcBXhxiTNCgLkjyVnrve/arqW7MYj6Q7sw3Susr2Z47wCZXutiRX08w+sybJtVW1aZLNgP+uqh2HHZ90dyS5DbiUiRu0qqodZjEkST1sg7Susv2ZO3xCpUG4Dbi1fX9tkiU0K9RvM7yQpIFZY4MljTTbIK2rbH/miPWGHYDmriT3ad+eBTyrff914CTgFGD5MOKSJK37bIMkjQq7/Gnaemaf2ZQmOf8YsB/wT8Ai4F+r6sphxijdXQ4KlkaTbZDWdbY/c4cJlaat/z96kmuqarNhxiRJmh9sgySNCrv86e4wG5ckDYttkKSR4KQUujsW9k3n2b/tdJ6SpJliGyRpJNjlT9OW5BImv0PodJ6SpBlhGyRpVJhQSZIkSdI0OYZKkiRJkqbJhEqSJEmSpsmESpIkSZKmyYRKkiRJkqbJhEqSJEmSpun/B23eD7xBAH87AAAAAElFTkSuQmCC\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Inspect the target variable\n",
"train_survived_value_counts = df_train.survived.value_counts()\n",
"test_survived_value_counts = df_test.survived.value_counts()\n",
"\n",
"\n",
"plt.figure(figsize=(12, 4))\n",
"\n",
"plt.subplot(121)\n",
"train_survived_value_counts.plot.bar()\n",
"train_sex_ratio = train_survived_value_counts[True]/train_survived_value_counts[False]\n",
"plt.title(f'Train set: survivied ratio: {train_sex_ratio:.2f}')\n",
"plt.ylabel('Number of passengers')\n",
"\n",
"plt.subplot(122)\n",
"test_survived_value_counts.plot.bar()\n",
"test_sex_ratio = test_survived_value_counts[True]/test_survived_value_counts[False]\n",
"plt.title(f'Test set: surived ratio: {test_sex_ratio:.2f}')\n",
"\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next up, let's check whether the ratio of male to female passengers is not too dissimilar between the two sets."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:38.073343Z",
"start_time": "2020-05-01T17:12:37.733604Z"
}
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1QAAAEUCAYAAAAspncYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nO3debhkVXnv8e+PIYA0DQItEZTJCUXFoRUnQMVgNBoJeBUn0AQh8eJwxSgqIiqoqGgUVAYHQIUgCokIDhjn2VaD2AjKKLPN1NDM4Hv/2PtAUZxhn9PndNXp8/08z35611p7eKu6u95ae629dqoKSZIkSdLkrTLoACRJkiRptrJBJUmSJElTZINKkiRJkqbIBpUkSZIkTZENKkmSJEmaIhtUkiRJkjRFNqg0o5J8I8keg45jRUnyzCSXDjiGf0tyVZJlSTZYged9dZIfr6jzjXL+dyT5zKDOL0mDkOTAJF8c4PmT5PNJrkvyyxV87mOSHLQiz9l3/jn1G0djs0Gl+2h/iI8sf01yS8/rV0zmWFX1vKo6dqZi7Zdk8ySVZLUVdc5hkmR14KPATlU1r6quGXRMM2G0hmtVvb+q9pym478vyVlJ7kxy4ATbrpHkiLYRe22SU5NsMpVjSbrHdOai9njfTzIt3xF9xx3oxaQh8Azg74AHVdWTBx3MTBmt4Tpdv3GS/E2SryS5qP0N88wJtl/Wt9yV5LCe+j2TnNfWfTPJxssbo8Zng0r30f4Qn1dV84A/Ay/sKfvSyHZztdEy5DYC1gQWDzqQqWqvdg76u+k84K3AaR22fSPwVOCxwMbA9cBhPfWTOZakVtdcpIHbDLioqm4adCBTNSS/Z34MvBK4cqIN+/5vbATcApwEkGQH4P3Ai4D1gQuBE2YqaDUG/aNFs8hIr0CStyW5Evh8kvsn+XqSJW13/9eTPKhnn7uvCI5cxUvykXbbC5M8b5zzvS3JZUluTHJukh3b8lWS7Jfk/CTXJPlykvXb3X7Y/nl9e2XmqR3e14FJTkryxfZcZyV5eJK3J/lLkkuS7NSz/WuS/KHd9oIke49z7I2TfLX9fC5M8oYxtntKkiuTrNpT9k9JfteuPznJoiQ3tD0hHx3lGA8Hzu15/99ty7dKckbbe3Jukpf07HNMkk+lGbawLMlPkvxtkv9o/47OSfL4nu1HPvcbk5yd5J/Gee9jnneUbb+f5OAkPwFuBrYc63NOsjbwDWDjnqtzG/dfPUzyj0kWJ7m+Pf4jxzp/v6o6tqq+AdzYYfMtgG9V1VVVdSvwn8DWUzyWpAmMlwOSrNl+l1/T/t//VZKNkhwMbAcc3n5nHD7KcUfdt61bN8lnk1zR5qWDkqzafq8cATy1Pe71Hd/D99tj/LTd79QkGyT5Uvs9/6skm/ds//E2F92Q5NdJthvn2E9pj3t9kjMzRm9H+xl+pa/s40k+0a6/uv3uvbHNX/fpFUzyL8Bnet7/e9ryFyT53zaGnyZ5bM8+FyX59yS/S3JT+7lu1OahG5N8J8n9e7Y/KU1+XJrkh0m27o+jZ9sxzzvKtpXk/yb5E/Cnnvd/n885yd8D7wBe2r7PM9vy3t84qyTZP8nFaX47HJdk3bHO36uqbq+q/6iqHwN3ddmnx4uBvwA/al+/EDipqhZX1e3A+4DtkzxkksfVJNig0mT9Lc0Vj82AvWj+DX2+fb0pzVWS+ySqHtvS/OjfEPgQ8Nkk6d8oySOAfYAnVdU6wHOBi9rqNwA7AzvQ9AhcB3yyrdu+/XO99urNz5Js2n65bjpOXC8EvgDcH/gt8K32vW0CvBc4smfbvwAvAOYDrwE+luQJo7yHVYBTgTPb4+wIvCnJc/u3raqfAzcBz+4pfjlwfLv+ceDjVTUfeAjw5VGO8Ufu+SG/XlU9u218nNEe5wHAy4BP9SWklwD70/yd3Ab8DPhN+/orNEMIR5xP86NkXeA9wBeTPHCU997lvP1eRfNvah3gYsb4nNuroM8DLu+5Snd53/kfTnNF7k3AAuB04NQkf9PWfyrJp8aJZTI+Czw9TaPufsAraBp8kmbGeDlgD5rvpwcDGwD/CtxSVe+k+cG5T/udsc8oxx1137buWOBO4KHA44GdgD2r6g/tdj9rj7seQJKXp70gNo7daL73NqH5Xv8ZTT5dH/gD8O6ebX8FPK6tOx44Kcma/QdMM9z4NOCgdtu3AF9NsmCU858APD/J/HbfVWnywfHtd/gngOe1OfhpwP/2H6CqPtv3/t/d5sPPAXvTfI5HAl9LskbPrrvSDBN8OE3+/QZNg2VDmtzbe/HxG8DDaHLJb4BReyc7nrffzjS/Sx7Vvh71c66qb9L0+pzYvs9tRjnWq9vlWcCWwDx6fg+1DciXjxPLVO0BHFdVNXKqdqHnNcCjZ+Dcatmg0mT9FXh3Vd1WVbdU1TVV9dWqurmqbgQOpklyY7m4qo6uqrtoEtQDabqr+90FrAE8KsnqVXVRVZ3f1u0NvLOqLq2q24ADgRdnjC77qvpzVa1XVX8eJ64fVdW3qupOmm7zBcAHq+oOmh6HzZOs1x7vtKo6vxo/AL5N08jo9yRgQVW9t736dAFwNE0SHc0JNA0PkqwDPJ97uunvAB6aZMOqWtY2wLp4Ac1QjM9X1Z1V9RvgqzRXtEacUlW/bntXTgFurarj2r+jE2l+PNC+95Oq6vKq+mtVnUhzVW+0MfNdztvvmPaK2p1VdcckPufRvBQ4rarOaP8OPwKsRfOjgKp6XVW9ruOxJvJHmuFIlwE3AI+kaYRLmhnj5YA7aH5MP7Sq7mq/227oeNxR903TS/U84E1VdVNV/QX4GGN/l1NVx1fVmL0jrc+333FLaRoN51fVd3ryUO937xfbfHtnVR1Kkx8fMcoxXwmcXlWnt9/TZwCLaPJJf4wX0zRQdm6Lng3c3JNf/go8OslaVXVFVXUdSv5a4Miq+kX7OR5Lc7HuKT3bHNb26l9G09D9RVX9tv37PKXvvX+uqm7s+bveZoyeny7n7feBqrq2qm5pz9X1cx7NK4CPVtUFVbUMeDuw28hvk6p6bFUdP+4RJqm9ULwDze+pEacDL0ny2CRrAQcABdxvOs+te7NBpcla0v7wBiDJ/ZIc2XZx30Az5G699Axd63P32OCqurldnde/UVWdR9O7cCDwlyT/mXtuqtwMOKXtdbqe5kreXYzeMOvqqp71W4Cr2wbFyOu740zyvCQ/TzOU7XqaRLXhKMfcjGZY2vU9sb5jnDiPB3Zpr6btAvymTXgA/0JzJe+cNENBXtDxfW0GbNsXwytoehrHeu/9r+/++0mye89wiutprniN9d4nOm+/S3pfTOJzHs3GNL1cAFTVX9vjbzLmHlP3aZr71jYA1gZOxh4qaSaNlwO+QDPC4D+TXJ7kQ2km6+lirH03A1YHrug555E0PSbLYzLfvfumGQK9tD3/uoz93ft/+r57n0Fz8XI0x9NeyKNnVEQ1IwFeStP7dEWS05Js1fF9bQbs2xfDg2m+l0d0eu9phlV+MM3wzhu4Z6TKWO99ovP26887XT/n0dwr77Trq7F8v00msjvw46q6cKSgqv6Hpnfzq20MF9EMOR/oDMQrOxtUmqzqe70vzdWbbasZjjYy5O4+w/gmfaLmCt8zaL4kCzikrbqEZhjCej3Lmu2Vrv74plXb2PkqTY/HRtUM7zid0d/vJcCFfXGuU1X3uVIIUFVn03z5PY97D/ejqv5UVS+jSeCHAF9ph2RM5BLgB30xzKuqf+v+rhtJNqPpYdsH2KB9779n7Pc+2fPe/XfX4XOe6O/5cpp/NyPHC01ivWyC/aZiG5retWvbK6iHAU9O0jUJS5qcMXNA27v9nqp6FE2P9AtofnTCBN8b4+x7CU1Px4Y955tfVSNDmGc672wHvI1mON792+/DpYz93fuFvs9m7ar64BiHPwl4Zpp7n/+Je+edb1XV39E0xs6h+f7v4hLg4L4Y7ldVU5kY4eU0kys8h6Zxs3lbPtZ7n+x5e/PORJ/zpPIOzW0Qd3LvxuJ02517904BUFWfrKqHVdUDaHLpajT5WjPEBpWW1zo0V5OuT3NT8Lsn2L6TJI9I8uz2h/Wt7TlGeoyOAA5uf+CTZEGSF7V1S2iGKWw5HXGM4m9ohgAsAe5MM6nGTmNs+0vghjSTa6zVXml7dJInjXP842nGjm9PO2MPQJJXJlnQ9rSM3PTc5cbVrwMPT/KqJKu3y5MyiQkaeqxNk1CWtDG9hrHHZC/veSf6nK8CNhhj2Ac095j9Q5Id2yvM+9L8IPppl5O38a5J8x25Wpqb1cfqdf0VsHuam9ZXB15Hc3/X1VM4lqSJjZkDkjwryWPa/2M30AzjG/muvIpxcsNY+1bVFTRDjg9NMj/N5AMPSTOb2shxH5T2Hs0ZsA7ND/MlNN8hB9DcWzqaLwIvTPLcNuesmWZCqQeNtnFVLQG+T3Pv1oXV3BNGmkki/rG9cHcbsIzukyUcDfxrkm3TWDvJP6QZyj5Z67Tnv4ZmyNr7Z/C8E33OV9EM/x/rt/MJwP9LskWSedxzz9WdXU6e5hEcI/fF/U37dzfmxekkT6MZdXFSX/ma7W+NpBkSeBTNPdjXdYlDU2ODSsvrP2juTbka+DnwzWk67hrAB9vjXknTM/OOtu7jwNeAbye5sT3vtnD3MMKDgZ+0Xf5PSTMpxbKMPylFJ9XcJ/YGmh/s19FcPfvaGNveRXOz7eNopi29mmY2pPFm/TkBeCbw3ZEf5K2/BxYnWUbz/nfrHXo5Qbw70Yz1v5zmszyE5vOdlLYH7VCaG6evAh4D/GQmzjvR51xV59B8Vhe0f88b9+1/Ls29BIfRfO4vpJly+XaANM+NOmKcEI6macS/DHhnu/6qdt/t2r+HEW+hafT/iSYRP5/mSu+Ex5I0JWPmAJphxV+haRD9AfgBTSNjZL8Xp5nB9BOjHHe8fXenudBzNs130le4Zxjdd2keVXFlkpELKa9IMl2Pr/gWzTDiP9KMYriVvqFqI6rqEpoenXfQfB9dAvw74//eO56mB6j3/p5VaC5EXQ5cS3OfTqf7TqtqEc39TIfTfFbn0UzWMBXH0bzny2g++zHvH56G8070OY80XK5J8ptR9v8czbDRH9Lk/FuB149Uppl1drznp51Lkx82aWO5hbbHK82D6/uHku8BnNzmy15r0vxdLqO5sPsz4F3jnFfTIFUz2lMtSZIkSSste6gkSZIkaYpsUEmSJEnSFNmgkiRJkqQpskElSVrpJdknyaIktyU5pqf8Fe2kNSPLzUkqyRPb+gOT3NG3zUzNIipJmoVWWIOqLxktS3JXksN66ndMck6bzL43Mh1qW5ckhyS5pl0+NN5UkpIk9bkcOIhmJq67VdWX2mekzauqeTQzmV0A9M7idWLvNlV1wYoLW5I07FZbUSdqExUA7XMNrqKdgjLNAzBPBvYETgXeB5wIPKXdZS9gZ5oHaBZwBk3CG2/aYzbccMPafPPNp/NtSJIG6Ne//vXVVbVgsvtV1ckASRYCoz6Tp7UHcFxNwxS45iBJWnmMl39WWIOqz4uBvwA/al/vAiyuqpEG1oHA1Um2ap83swdwaFVd2tYfSvOsgXEbVJtvvjmLFi2amXcgSVrhklw8g8fejOah2v/cV/XCJNcCVwCHV9WnxznGXjQXAdl0003NQZK0khgv/wzqHqr+K4BbA2eOVFbVTcD5bfl96tv1rRlFkr3acfKLlixZMu2BS5JWWrsDP6qqC3vKvgw8ElhAcyHvgCQvG+sAVXVUVS2sqoULFky6I02SNAut8AZVkk1pnrh9bE/xPGBp36ZLgXXGqF8KzBvtPiqTmSRpinbn3rmJqjq7qi6vqruq6qfAx2lGWUiSBAymh2p34Md9VwCXAfP7tpsP3DhG/Xxg2XSMcZckKcnTgY2Br0ywaQFOiiRJutugGlTH9pUtpplwArh70oqHtOX3qW/XFyNJUgdJVkuyJrAqsGqSNZP03ke8B/DVqrqxb78XJbl/O9vsk4E3AP+94iKXJA27FdqgSvI0YBPa2f16nAI8OsmubcI7APhdOyEFwHHAm5NskmRjYF/gmBUUtiRp9tsfuAXYD3hlu74/QJt3XsJ9L/YB7AacRzNi4jjgkKoabTtJ0hy1omf52wM4uf8KYFUtSbIrcDjwReAXNElsxJHAlsBZ7evPtGWSJE2oqg4EDhyj7lZgvTHqxpyAQpIkWMENqqrae5y67wBbjVFXwFvbRZIkSZKGwqCeQ6UJbL7faYMOYc656IP/MOgQJGngzD+DYQ6SZq9BPYdKkiRJkmY9G1SSJEmSNEU2qCRJkiRpimxQSZIkSdIU2aCSJEmSpCmyQSVJkiRJU2SDSpIkSZKmyAaVJEmSJE2RDSpJkiRJmiIbVJIkSZI0RTaoJEmSJGmKOjWokixIsqDn9WOSHJTkZTMXmiRJkiQNt649VF8GXgiQZEPgh8A/AUck2XeGYpMkSZKkoda1QfVY4Oft+ouB86pqa2B3YO+ZCEySJEmShl3XBtVawLJ2/TnA19r13wAPnu6gJEmSJGk26Nqg+hOwS5IHAzsB327LNwKun4nAJEmSJGnYdW1QvQc4BLgI+HlV/aItfy7w28mcMMluSf6Q5KYk5yfZri3fMck5SW5O8r0km/XskySHJLmmXT6UJJM5ryRp7kqyT5JFSW5LckxP+eZJKsmynuVdPfXmH0nSuFbrslFVnZxkU2Bj4Myequ8AX+16siR/R9MweynwS+CBbfmGwMnAnsCpwPuAE4GntLvuBewMbAMUcAZwAXBE13NLkua0y4GDaC4ErjVK/XpVdeco5eYfSdK4JuyhSrJ6kiuBDavqt1X115G6qvpFVZ0zifO9B3hvVf28qv5aVZdV1WXALsDiqjqpqm4FDgS2SbJVu98ewKFVdWm7/aHAqydxXknSHFZVJ1fVfwHXTHJX848kaVwTNqiq6g7gDporc1OWZFVgIbAgyXlJLk1yeJK1gK3p6fmqqpuA89ty+uvb9a2RJGl6XNzmpc+3oyZGTCr/JNmrHVq4aMmSJTMVqyRpiHS9h+ow4O1JOg0RHMNGwOo0065vBzwOeDywPzAPWNq3/VJgnXa9v34pMG+0cewmM0nSJFwNPAnYDHgiTd75Uk995/wDUFVHVdXCqlq4YMGCGQpZkjRMujaQtgN2AC5L8nvgpt7KqvrHDse4pf3zsKq6AiDJR2kaVD8E5vdtPx+4sV1f1lc/H1hWVffpNauqo4CjABYuXLhcvWqSpJVbVS0DFrUvr0qyD3BFkvlVdQOTyD+SpLmpa4PqaiYx+cRoquq6JJcy+tDBxTTj1AFIsjbwkLZ8pH4bmoksaNcXI0nS9BrJUSM9UOYfSdK4us7y95ppOt/ngdcn+SbNfVlvAr4OnAJ8OMmuwGnAAcDveia8OA54c5LTaZLdvjTDECVJmlA7ZH01YFVg1SRrAnfSDPO7nuZ5i/cHPgF8v6pGhvmZfyRJ4+p6DxUASRYmeWnbg0SStSd5X9X7gF8BfwT+QPMMq4OragmwK3AwcB2wLbBbz35H0kynfhbwe5pG15GTiV2SNKftTzP0fD/gle36/sCWwDdphpj/HrgNeFnPfuYfSdK4OjWGkmwEfI3mxt0CHkbzHI6PArcCb+xynHbGwNe1S3/dd4Ct7rNTU1fAW9tFkqRJqaoDaR7JMZoTxtnP/CNJGlfXHqqPAVcCGwA395SfBOw03UFJkiRJ0mzQdbjejsCO7cQSveXnA5tOe1SSJEmSNAt07aFaC7h9lPIFNEP+JEmSJGnO6dqg+iHw6p7XlWRV4G3A/0x3UJIkSZI0G3Qd8vdW4AdJngSsARwKbA2sCzx9hmKTJEmSpKHWqYeqqs4GHgP8FPg2sCbNhBSPr6rzZy48SZIkSRpenZ8hVVVXAu+ewVgkSZIkaVbp+hyq7ceoKppJKc6vqmunLSpJkiRJmgW69lB9n6bxBDAyb3rv678m+Rrwqqq6afrCkyRJkqTh1XWWv38A/gC8Enhou7wSWAzs2i6PAz44AzFKkiRJ0lDq2kN1EPDGquqdIv2CJEuAQ6rqiUnuAg4DXj/dQUqSJEnSMOraQ/Uo4LJRyi9r6wDOAv52OoKSJEmSpNmga4PqbOCdSdYYKWjX39HWATwYuHJ6w5MkSZKk4dV1yN/rgFOBy5L8nmZCiscAfwVe0G6zJfCpaY9QkiRJkoZUpwZVVf0iyRY0E1E8gmZmvxOAL43M6ldVx81YlJIkSZI0hCbzYN+bgCNnMBZJkiRJmlU6N6iSPBjYDngAffdeVdVHpzkuSZIkSRp6nRpUSV4BfA64E1jCPQ/1pV23QSVJkiRpzuk6y997gUOB+VW1eVVt0bNs2fVkSb6f5NYky9rl3J66HZOck+TmJN9LsllPXZIckuSadvlQknR+l5KkOS3JPkkWJbktyTE95U9JckaSa5MsSXJSkgf21B+Y5I6evLUsSee8J0la+XVtUG0EfKaq7pqGc+5TVfPa5REASTYETgbeBawPLAJO7NlnL2BnYBvgsTQzC+49DbFIkuaGy2keUv+5vvL7A0cBmwObATcCn+/b5sSevDWvqi6Y6WAlSbNH13uoTge2BWYqiewCLK6qk6C5IghcnWSrqjoH2AM4tKoubesPBV4LHDFD8UiSViJVdTJAkoXAg3rKv9G7XZLDgR+s2OgkSbNZ1wbVGcAhSbYGzgLu6K0cSVQdfSDJB4FzgXdW1feBrYEze453U5Lz2/Jz+uvb9a0ncU5JkrrYHljcV/bCJNcCVwCHV9Wnx9o5yV40oyrYdNNNZyxISdLw6NqgGpku/R2j1BWwasfjvA04G7gd2A04NcnjgHk0k130Wgqs067Pa1/31s1LkqrqnSDDZCZJmpIkjwUOAF7UU/xlmiGBV9GM1Phqkuur6oTRjlFVR7Xbs3DhwhptG0nSyqXTPVRVtco4S9fGFFX1i6q6sapuq6pjgZ8AzweWAfP7Np9PM5adUernA8v6G1PtOY6qqoVVtXDBggVdQ5MkzWFJHgp8A3hjVf1opLyqzq6qy6vqrqr6KfBx4MWDilOSNHy6TkoxUwoIzfCKbUYKk6wNPIR7hl3cq75d7x+SIUnSpLWzyn4HeF9VfWGCzUfyliRJQMcGVTtt+euSLG6nNd+yLd8vyUs6HmO9JM9NsmaS1dpnW20PfAs4BXh0kl2TrEkz5OJ37YQUAMcBb06ySZKNgX2BYyb1TiVJc1abd9akGaK+ak8u2gT4LvDJqrrPREdJXpTk/m0efDLwBuC/V2z0kqRh1rWH6o3A/jTjwnuvzF0G7NPxGKvTTFm7BLgaeD2wc1WdW1VLgF2Bg4HraMap79az75HAqTQTYvweOI177uuSJGki+wO3APsBr2zX9wf2BLYE3t37rKme/XYDzqMZgn4ccEg7ZF2SJKD7pBT/Cry2qk5LclBP+W/oONte22h60jj13wG2GqOugLe2iyRJk1JVBwIHjlH9nnH2e9lMxCNJWnl07aHajKZnqN8dwFrTF44kSZIkzR5dG1QXAE8Ypfz5NNOgS5IkSdKc03XI30eAw5Pcj+YeqqcmeRXNELx/nqngJEmSJGmYdWpQVdXnk6wGvB+4H/AFmgkp3lBVJ85gfJIkSZI0tLr2UFFVRwNHJ9kQWKWq/jJzYUmSJEnS8Ov6HKpVkqwCUFVXA6sk2TPJ02Y0OkmSJEkaYl0npTiN5rlRJJkHLAI+DPwgye4zFJskSZIkDbWuDaon0jxJHmAX4AbgAcBrgbfMQFySJEmSNPS6NqjWAa5v13cCTqmqO2gaWQ+ZicAkSZIkadh1bVD9GXh6krWB5wJntOXrAzfPRGCSJEmSNOy6zvL3UZqp0pcBFwM/bMu3B86agbgkSZIkaeh1fQ7VkUl+DTwYOKOq/tpWnQ+8a6aCkyRJkqRhNpnnUC2imd0PgCSrV9VpMxKVJEmSJM0CXZ9D9YYku/a8/ixwS5JzkzxixqKTJEmSpCHWdVKKNwBLAJJsD7wEeDnwv8ChMxOaJEmSJA23rkP+NgEuatdfCJxUVV9Ochbwo5kITJIkSZKGXdceqhuABe363wH/067fAaw53UFJkiRJ0mzQtYfq28DRSX4LPBT4Rlu+NXDhTAQmSZIkScOuaw/V/wV+AmwIvLiqrm3LnwCcMNmTJnlYkluTfLGnbMck5yS5Ocn3kmzWU5ckhyS5pl0+lCSTPa8kaW5Ksk+SRUluS3JMX535R5I0ZV2fQ3UD8PpRyt89xfN+EvjVyIskGwInA3sCpwLvA04EntJushewM7ANUMAZwAXAEVM8vyRpbrkcOAh4LrDWSKH5R5K0vLr2UN0tyd8m2bR3meT+uwHXc899WAC7AIur6qSquhU4ENgmyVZt/R7AoVV1aVVdRjOz4KsnG7skaW6qqpOr6r+Aa/qqzD+SpOXS9TlU6yY5NsktwGU09031Lp0kmQ+8F9i3r2pr4MyRF1V1E3B+W36f+nZ9ayRJWj7Tmn+S7NUOLVy0ZMmSGQhXkjRsuvZQfYRmuMPOwK00z6D6d+BS4KWTON/7gM9W1SV95fOApX1lS4F1xqhfCswbbRy7yUySNAnTln8AquqoqlpYVQsXLFgw2iaSpJVM11n+nge8rKp+lOQu4NdVdWKSK4C9ga9MdIAkjwOeAzx+lOplwPy+svnAjWPUzweWVVX1H6iqjgKOAli4cOF96iVJ6jFt+UeSNDd17aFaD7i4XV8KbNCu/wx4WsdjPBPYHPhzkiuBtwC7JvkNsJimBwyAJGsDD2nL6a9v1xcjSdLyMf9IkpZL1wbV+cCW7fofgN3a4Q67ANeOude9HUWTpB7XLkcAp9HMuHQK8OgkuyZZEzgA+F1VndPuexzw5iSbJNmY5h6sYzqeV5I0xyVZrc0vqwKrJlkzyWqYfyRJy6lrg+oY4LHt+gdphvndDnwYOKTLAarq5qq6cmShGUZxa1UtqaolwK7AwcB1wLbAbj27H0kzne1ZwO9pGmJHdoxdkqT9gVuA/YBXtuv7m38kScur63OoPtaz/t0kjwSeCPypqs6ayomr6sC+198Bthpj2wLe2i6SJE1Km3MOHKPO/CNJmrKuk1LcS1VdzD33VEmSJEmzyub7nTboEOakixoho+gAABWHSURBVD74D4MOYdp1frBvkp2T/DDJ1e3yoyT/NJPBSZIkSdIw69RDlWRf4P00N+ce0xY/FTg+ybuq6iMzE56klZ1XCFe8lfHqoCRJg9J1yN9bgH2q6uiess8l+SXwXpoH/0qSJEnSnNJ1yN884HujlH+vrZMkSZKkOadrg+q/gBePUr4r8LXpC0eSJEmSZo+uQ/7OA/ZL8izgZ23ZU9rlo0nePLJhVX10ekOUJEmSpOHUtUH1apoHHj68XUZcB7ym53UBNqgkSZIkzQldH+y7xUwHIkmSJEmzTefnUEmSJEmS7s0GlSRJkiRNkQ0qSZIkSZoiG1SSJEmSNEVjNqiSfC7JOu369km6zggoSZIkSXPCeD1UrwTWbte/B6w/8+FIkiRJ0uwxXq/TRcDrk3wbCPDUJNeNtmFV/XAGYpMkSZKkoTZeg+rfgaOBt9M8sPeUMbYrYNVpjkuSJEmSht6YDaqq+m/gv5OsB1wLbA38ZUUFJkmSJEnDbsJZ/qrqeuBZwJ+q6prRlq4nS/LFJFckuSHJH5Ps2VO3Y5Jzktyc5HtJNuupS5JDklzTLh9Kksm+WUmS+iVZ1rfcleSwtm7zJNVX/65BxyxJGh6dZu6rqh8kWSPJ7sCjaIb5nQ0cX1W3TeJ8HwD+papuS7IV8P0kvwUuBk4G9gROBd4HnAg8pd1vL2BnYJv23GcAFwBHTOLckiTdR1XNG1lPsjZwFXBS32brVdWdKzQwSdKs0Ok5VEkeBfwR+CiwLU1D52PAH5M8suvJqmpxTwOs2uUhwC7A4qo6qapuBQ4EtmkbXQB7AIdW1aVVdRlwKPDqrueVJKmjF9MMb//RoAORJM0OXR/s+3Hgf4FNq2q7qtoO2BQ4E/iPyZwwyaeS3AycA1wBnE5zf9aZI9tU1U3A+W05/fXt+tZIkjS99gCOq6rqK784yaVJPp9kw7F2TrJXkkVJFi1ZsmRmI5UkDYWuDaqnA++oqhtGCtr1dwLPmMwJq+p1wDrAdjTD/G4D5gFL+zZd2m7HKPVLgXmj3UdlMpMkTUWSTYEdgGN7iq8GngRsBjyRJi99aaxjVNVRVbWwqhYuWLBgJsOVJA2Jrg2qW4H1Rilft62blKq6q6p+DDwI+DdgGTC/b7P5wI3ten/9fGDZKFcQTWaSpKnaHfhxVV04UlBVy6pqUVXdWVVXAfsAOyXpz1mSpDmqa4PqVODoJE9Psmq7PAM4Evjacpx/NZp7qBbTTDgB3H1T8Eg5/fXt+mIkSZo+u3Pv3qnRjFzIc6ZZSRLQvUH1RuBPNDfp3touP6CZqOJNXQ6Q5AFJdksyr22QPRd4GfBdmocGPzrJrknWBA4AfldV57S7Hwe8OckmSTYG9gWO6Ri7JEnjSvI0YBP6ZvdLsm2SRyRZJckGwCeA71dV/zB1SdIc1XXa9OuBFyV5KPBImitzZ1fVeZM4V9EM7zuCpiF3MfCm9gHCJNkVOBz4IvALYLeefY8EtgTOal9/pi2TJGk67AGcXFU39pVvCbwfeABwA81jO162gmOTJA2xTg2qEW0DajKNqN59l9Dc7DtW/XeArcaoK+Ct7SJJ0rSqqr3HKD8BOGEFhyNJmkW6DvmTJEmSJPWxQSVJkiRJU2SDSpIkSZKmaMIGVZLVkryunV1PkiRJktSasEFVVXcCHwZWn/lwJEmSJGn26Drk7+fAE2YyEEmSJEmabbpOm340cGiSzYBfAzf1VlbVb6Y7MEmSJEkadl0bVMe3f350lLoCVp2ecCRJkiRp9ujaoNpiRqOQJEmSpFmoU4Oqqi6e6UAkSZIkabbp/ByqJM9L8vUkZyd5cFu2Z5IdZy48SZIkSRpenRpUSV4BfBn4E83wv5Ep1FcF3jozoUmSJEnScOvaQ/VW4LVV9f+AO3vKfw48btqjkiRJkqRZoGuD6mHAz0YpXwbMn75wJEmSJGn26Nqguhx4+Cjl2wPnT184kiRJkjR7dG1QHQV8IsnT29cPTrIH8CHg0zMSmSRJkiQNua7Tpn8oybrAGcCawPeA24CPVNUnZzA+SZIkSRpaXR/sS1W9M8nBwKNoerbOrqplMxaZJEmSJA25zs+hahVwK3AzcNdkdkyyRpLPJrk4yY1JfpvkeT31OyY5J8nNSb6XZLOeuiQ5JMk17fKhJJlk7JIkjSrJ95PcmmRZu5zbUzdmfpIkqetzqNZI8h/AtcCZwO+Aa5N8PMmaHc+1GnAJsAOwLvAu4MtJNk+yIXByW7Y+sAg4sWffvYCdgW2AxwIvAPbueF5JkrrYp6rmtcsjADrkJ0nSHNd1yN+ngZ2APbln+vSnAh8A1gH+eaIDVNVNwIE9RV9PciHwRGADYHFVnQSQ5EDg6iRbVdU5wB7AoVV1aVt/KPBa4IiO8UuSNBW7MH5+kiTNcV2H/P0f4DVV9aWquqBdvgT8C/DiqZw4yUY0U7EvBram6fkC7m58nd+W01/frm/NKJLslWRRkkVLliyZSmiSpLnpA0muTvKTJM9syybKT/diDpKkuadrg+om4LJRyi8DbpnsSZOsDnwJOLa9wjcPWNq32VKa3i9GqV8KzBvtPqqqOqqqFlbVwgULFkw2NEnS3PQ2YEtgE5pHhZya5CFMnJ/uxRwkSXNP1wbVYcC7k6w1UtCuv6ut6yzJKsAXgNuBfdriZcD8vk3nAzeOUT8fWFZVNZlzS5I0mqr6RVXdWFW3VdWxwE+A5zNxfpIkzXFj3kOV5Gt9Rc8ELkvyu/b1Y9r91+56srZH6bPARsDzq+qOtmoxzX1SI9utDTykLR+p3wb4Zft6m546SZKmWwFh4vwkSZrjxpuU4pq+11/te33hFM73aeCRwHOqqneo4CnAh5PsCpwGHAD8rueG3+OANyc5nSbJ7cske8YkSRpNkvWAbYEfAHcCLwW2B95EM7vtePlJkjTHjdmgqqrXTOeJ2ud27A3cBlzZc/vT3lX1pTZZHQ58EfgFsFvP7kfSjG0/q339mbZMkqTltTpwELAVzTMWzwF2rqpzASbIT5KkOa7rtOnLraouphk+MVb9d2iS2Wh1Bby1XSRJmjZVtQR40jj1Y+YnSZI6NaiS3J/mGVLPAh5A32QWVfWAaY9MkiRJkoZc1x6q42ieuXEscBXNfUySJEmSNKd1bVA9E9ihqn4zg7FIkiRJ0qzS9TlU509iW0mSJEmaE7o2kt4IfCDJNklWncmAJEmSJGm26Drk7zxgLeA3AD1TngNQVTayJEmSJM05XRtUJwDrAm/ASSkkSZIkCejeoFoIPLmqfj+TwUiSJEnSbNL1HqqzgfkzGYgkSZIkzTZdG1T7Ax9N8pwkGyVZv3eZyQAlSZIkaVh1HfJ3evvnt7n3/VNpXzsphSRJkqQ5p2uD6lkzGoUkSZIkzUKdGlRV9YOZDkSSJEmSZptODaokTxivvqp+Mz3hSJIkSdLs0XXI3yKae6V6n+jbey+V91BJkiRJmnO6Nqi26Hu9OvB44J3A26c1IkmSJEmaJbreQ3XxKMXnJVkKvBv4xrRGJUmSJEmzQNfnUI3lQuBx0xGIJEmSJM02nRpU/Q/yTbJBkkcDHwDO7XqyJPskWZTktiTH9NXtmOScJDcn+V6SzXrqkuSQJNe0y4eS5D4nkCRpkpKskeSzSS5OcmOS3yZ5Xlu3eZJKsqxnedegY5YkDY+u91Bdzb0noYBmgopLgJdO4nyXAwcBzwXWuvtAyYbAycCewKnA+4ATgae0m+wF7Axs08ZxBnABcMQkzi1J0mhWo8lnOwB/Bp4PfDnJY3q2Wa+q7hxEcJKk4TbVB/v+FVgCnDeZBFNVJwMkWQg8qKdqF2BxVZ3U1h8IXJ1kq6o6B9gDOLSqLm3rDwVeiw0qSdJyqqqbgAN7ir6e5ELgicCvBxKUJGnWGJYH+24NnNlzvpuSnN+Wn9Nf365vPdqBkuxF06PFpptuOlPxSpJWUkk2Ah4OLO4pvjjJyAiJf6+qq8fY1xwkSXPMuPdQjXLv1KjLNMQxD1jaV7YUWGeM+qXAvNHuo6qqo6pqYVUtXLBgwTSEJkmaK5KsDnwJOLYdIXE18CRgM5oeq3Xa+lGZgyRp7pmoh2q0e6f6VYfjTGQZML+vbD5w4xj184FlVTVRbJIkdZJkFeALwO3APgBVtYzm4fYAVyXZB7giyfyqumEwkUqShslEDaH+e6d6/T3wRmA6btJdTHOfFABJ1gYewj3DLRbTTEjxy/b1Ntx7KIYkSVPWjnj4LLAR8PyqumOMTUcu5DnTrCQJmKBBNdq9U0meABwCbA8cSTMjXydJVmvPuSqwapI1aRpkpwAfTrIrcBpwAPC7drgFwHHAm5OcTpPM9gUO63peSZIm8GngkcBzquqWkcIk2wLXA38C7g98Avh+VfUPU5ckzVGdH+ybZIskxwO/AK4FHlVVb6iqJZM43/7ALcB+wCvb9f3bY+wKHAxcB2wL7Naz35E006mfBfyeptF15CTOK0nSqNrnHu5N86D6K3ueN/UKYEvgmzRD0H8P3Aa8bGDBSpKGzoT3PiXZgKbH6F+BnwBPrapF4+81uqo6kHtPTdtb9x1gqzHqCnhru0iSNG2q6mLGH8J3woqKRZI0+0w0y987gPNpHnb4oqp69lQbU5IkSZK0spmoh+ogmmF5lwKvS/K60Taqqn+c7sAkSZIkadhN1KA6jomnTZckSZKkOWmiWf5evYLikCRJkqRZp/Msf5IkSZKke7NBJUmSJElTZINKkiRJkqbIBpUkSZIkTZENKkmSJEmaIhtUkiRJkjRFNqgkSZIkaYpsUEmSJEnSFNmgkiRJkqQpskElSZIkSVNkg0qSJEmSpsgGlSRJkiRNkQ0qSZIkSZoiG1SSJEmSNEWzpkGVZP0kpyS5KcnFSV4+6JgkSXODOUiSNJbVBh3AJHwSuB3YCHgccFqSM6tq8WDDkiTNAeYgSdKoZkUPVZK1gV2Bd1XVsqr6MfA14FWDjUyStLIzB0mSxjNbeqgeDtxVVX/sKTsT2KF/wyR7AXu1L5clOXcFxKd7bAhcPeggpiKHDDoCzTL+Wx+MzQZwTnPQ7OH/S80V/ltf8cbMP7OlQTUPWNpXthRYp3/DqjoKOGpFBKX7SrKoqhYOOg5ppvlvfU4xB80S/r/UXOG/9eEyK4b8AcuA+X1l84EbBxCLJGluMQdJksY0WxpUfwRWS/KwnrJtAG8GliTNNHOQJGlMs6JBVVU3AScD702ydpKnAy8CvjDYyDQKh7porvDf+hxhDppV/H+pucJ/60MkVTXoGDpJsj7wOeDvgGuA/arq+MFGJUmaC8xBkqSxzJoGlSRJkiQNm1kx5E+SJEmShpENKkmSJEmaIhtUkiRJkjRFNqgkqYMkayQ5OMkFSZa2ZTsl2WfQsUmSVm7moOFmg0rTIsnqSbZL8tL29dpJ1h50XNI0+hjwaOAVwMhsPouBfxtYRJIAc5DmBHPQEHOWPy23JI8BvgbcBjyoquYleT6wR1W9dLDRSdMjyRXAQ6vqpiTXVtX6bfn1VbXegMOT5ixzkOYCc9Bws4dK0+HTwAFVtRVwR1v2A+AZgwtJmna3A6v1FiRZQPNMIkmDYw7SXGAOGmI2qDQdtga+2K4XQFXdBKw1sIik6XcScGySLQCSPBA4HPjPgUYlyRykucAcNMRsUGk6XAQ8sbcgyZOB8wYSjTQz3kHzb/0sYD3gT8DlwHsGGJMkc5DmBnPQEPMeKi23JC8APgscAewLHAz8K/Daqvr2IGOTZkI7zOLq8gtUGjhzkOYac9DwsUGlaZHkCcCewGbAJcDRVfXrwUYlLZ8kW3bZrqoumOlYJI3NHKSVkTlo9rBBJUljSPJXmnsyMs5mVVWrrqCQJElzhDlo9rBBpSlJ8t4u21XVATMdiyRpbjEHSRomq028iTSqBw86AEnSnGUOkjQ07KGSpA6SrAa8DtgB2JCeIRhVtf2g4pIkrfzMQcPNadM1bZKsk2SLJFuOLIOOSZpGHwP2Bn5IM0XzV4EHAN8dZFCSGuYgreTMQUPMHiottySPAr4EbMM9N0+OPFzRGyW1UkhyGfDUqvpzkuurar0kWwFHVtUOg45PmqvMQZoLzEHDzR4qTYdPAd8D1gduAO4PHAnsMcigpGl2P5rpmAFuSXK/qjoHePwAY5JkDtLcYA4aYvZQabkluQ54QFXd0XPVZG3g91W1xaDjk6ZDkp8Cb6qqXyY5FfgDzY+3V1TVIwcbnTR3mYM0F5iDhps9VJoOtwKrt+tXJ9mU5t/WBoMLSZp2bwTubNffDDwBeCGw18AikgTmIM0N5qAhZg+VlluSLwOnV9UxST4I/CNNgvtzVe082OgkSSszc5CkQbNBpWmVZBXg5cA84LiqunnAIUnTJsnmwGNp/n3fraqOH0Q8ku7NHKSVmTloeNmg0nJLsi7wBpobI/v/k+80kKCkaZbk7cABwGLglp6q8hkg0uCYgzQXmIOG22qDDkArhZOAVYFTuPd/cmllsi/wxKo6e9CBSLoXc5DmAnPQELNBpenwFGCDqrpj0IFIM+ga4KJBByHpPsxBmgvMQUPMWf40HX4MOGWnVnZvAo5KsjDJpr3LoAOT5jhzkOYCc9AQ8x4qLbckDwBOB34BXNVbV1XvHUhQ0jRL8iLgaGDDvqqqqlUHEJIkzEGaG8xBw80hf5oOBwMPpumKnt9TbmtdK5NPAe8A/hPv05CGiTlIc4E5aIjZQ6XlluRG4OFVdcWgY5FmSpKrgI2r6q5BxyLpHuYgzQXmoOHmPVSaDhcA3gysld1HgP2SZNCBSLoXc5DmAnPQELOHSsstyVuAXYDDuO/49e8OJChpmiW5BPhb4Haa2ZbuVlXeFCwNiDlIc4E5aLjZoNJyS3LhGFVVVVuu0GCkGZJkh7HqquoHKzIWSfcwB2kuMAcNNxtUkiRJkjRF3kMlSR0kWSPJwUkuSLK0LdspyT6Djk2StHIzBw03G1SS1M3HgEcDr+Ce6ZgXA/82sIgkSXOFOWiIOeRPkjpIcgXw0Kq6Kcm1VbV+W359Va034PAkSSsxc9Bws4dKkrq5nb6HoSdZQN9sS5IkzQBz0BCzQSVJ3ZwEHJtkC4AkDwQOp3lqvSRJM8kcNMRsUEnSGPpu9j0SuAg4C1gP+BNwOfDeFR+ZJGllZw6aPbyHSpLGkGRpVa3brt9QVfPb9QXA1eUXqCRphpiDZo/VJt5Ekuas85McSjOT0upJXgNkpDJpVqvqc4MJT5K0EjMHzRL2UEnSGJI8HHgrsBnwLOBHo2xWVfXsFRqYJGmlZw6aPWxQSVIHSf6nqnYcdBySpLnHHDTcbFBJkiRJ0hQ5y58kSZIkTZENKkmSJEmaIhtUkiRJkjRFNqgkSZIkaYpsUEmSJEnSFP1/GKOaYeuMtukAAAAASUVORK5CYII=\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Check the sex balance\n",
"train_sex_value_counts = df_train.sex.value_counts()\n",
"test_sex_value_counts = df_test.sex.value_counts()\n",
"\n",
"plt.figure(figsize=(12, 4))\n",
"\n",
"plt.subplot(121)\n",
"train_sex_value_counts.plot.bar()\n",
"train_sex_ratio = train_sex_value_counts['male']/train_sex_value_counts['female']\n",
"plt.title(f'Train set: male vs female ratio: {train_sex_ratio:.2f}')\n",
"plt.ylabel('Number of passengers')\n",
"\n",
"plt.subplot(122)\n",
"test_sex_value_counts.plot.bar()\n",
"test_sex_ratio = test_sex_value_counts['male']/test_sex_value_counts['female']\n",
"plt.title(f'Test set: male vs female ratio: {test_sex_ratio:.2f}')\n",
"\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, lets check that the relative number of passenger per class is similar between the train and test sets."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:38.404343Z",
"start_time": "2020-05-01T17:12:38.078737Z"
}
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1QAAAEUCAYAAAAspncYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nO3df7xddX3n+9ebhBoliUgTUX7mij+oMISOcbD3juIM9joqVoc4d/BXQcfGtsN0vDKj3F7AqDAV7jDeacECFUGkP5Be8BfaH7Rota3WOBY0NXqliIBiDzTEJPwQ8TN/rO/RzeacnJ2Vc7I557yej8d6sNb3+11rf/faYX/PZ6/P+q5UFZIkSZKk3bfPuDsgSZIkSfOVAZUkSZIk9WRAJUmSJEk9GVBJkiRJUk8GVJIkSZLUkwGVJEmSJPVkQKXHvCSfSnLKuPuhPZPkhUnuGHc/JEmalGRjkqvG3Q/NbwZUmhNJdgwsP0py/8D2a3fnWFX1kqr64Fz1dViSNUkqydK99ZqSpPGYzfGqHe/TSd40B/08NcnnZvu4kvacfzBqTlTV8sn1JN8C3lRVNwy3S7K0qn64N/um2ZNkSVU9PO5+SFJfo45Xmh/8u0Lj4BUq7VWTaV9J3p7kLuDyJE9K8okkE0m2tvVDBvb58a99k7/QJfmvre2tSV6yi9d7e5I7k2xP8vUkJ7TyfZKckeSWJPck+XCSA9puf9H+e2/7hfLnRnhfG5P8YZKr22v9jyRrB+onX2t7kr9L8q8H6p6e5DNJtiW5O8nVrTxJ3pvkH1rdzUmObnWPa+fg20m+l+TiJI8fOsent32/m+QNA6/300k+nuT7Sb6Y5JzBXz2THJnkT5P8Yztn/8dA3RVJfjvJJ5PsBP7FFOfigCSXJ/lO+4w+Ms05m9VzIkmzaVfjRJJlSa5q5fe279IDk5wLPB+4sI0fF05x3Cn3bXVPTHJZ+96+s30/L0nyM8DFwM+149474nv4dJLfSPI37TvzowNjHUmuSXJXq/uLJEcN1L20fTdvb335T618Vbpx+t42Tnw2yT6t7qAk/1+68fzWJL82cLyN7Rxe2Y65Ocm6gfp/muTLre6adOPpOQP1Jyb52/a6f5XkmIG6b6Ub728GdmaKDJMkRw2Mbd9L8uvTnLNZPSdaHPywNQ5PAQ4ADgc20P07vLxtHwbcDzxqEBpwHPB1YBVwPnBZkgw3SvIs4DTguVW1Angx8K1W/WvAK4HjgYOArcBFre4F7b/7V9XyqvrrJIe1L8rDdtGvVwDXtPf2e8BHkuzb6m6hG2SfCLwTuCrJU1vdu4E/AZ4EHAL8Viv/31tfngnsD/xb4J5Wd14rPxZ4OnAwcPZAX57SXutg4N8BFyV5Uqu7CNjZ2pzSlslzth/wp63/TwZeDbxvcEABXgOcC6wApko/+RDwBOCodoz3Tn26Zv2cSNJs2tU4cQrdd9ehwE8DvwzcX1X/N/BZ4LQ2fpw2xXGn3LfVfRD4Id33+s/Sfee9qaq+1tr9dTvu/gBJXtOCiF35ReCN7T38EPjNgbpPAc+g+67+H8DvDtRdBry5jZ9HA3/eyk8H7gBWAwcCvw5UCyA+DtxEN/acALwlyYsHjvkLwB/QfX9/jDbWJ/kp4DrgCrox9PeBwR/Z/inwAeDN7ZxdAnwsyeMGjv1q4GV0Y/cjrlAlWQHcAPxROw9PB/5smvM1a+dkmuNrIaoqF5c5XeiCmBe19RcCPwCW7aL9scDWge1P0w0oAKcC3xyoewLdl9ZTpjjO04F/AF4E7DtU9zXghIHtpwIP0aXBrmnHXLob73Ej8PmB7X2A7wLPn6b93wKvaOtXApcChwy1+ZfAN4DnAfsMlIcuIDpioOzngFsHzvH9g/1v5+F5wJL2Pp81UHcO8Lm2/m+Bzw714xLgHW39CuDKXZyHpwI/Ap40Rd0LgTt2sW/vc+Li4uIyG8vQeLWrceKNwF8Bx0xxjB+PWdO8xpT70v0h/iDw+IGyVwM3tvVTJ7+rd+P9fBp4z8D2s+nG4CVTtN2/jX1PbNvfpgtgVg61exfwUeDpQ+XHAd8eKvu/gMvb+kbghqG+3N/WXwDcCWSg/nPAOW39t4F3Dx3768DxA5/bG3dxHl4NfHmauo3AVdPU7dE5cVk8i1eoNA4TVfXA5EaSJyS5JMltSb5Pl3K3f5Il0+x/1+RKVd3XVpcPN6qqbwJvofuy/Ickf5DkoFZ9OHBdu+p0L93A+TDdgNbX7QOv/SO6X6sOau/xFwdSFe6l+2VrVWv+Nrog6W9aCsQb2zH+nO7Xu4uA7yW5NMlKul/AngB8aeB4f9TKJ91Tj/yF7j66c7Sa7o+B2wfqBtcPB46bPG479mvprmZN1X7YocA/VtXWXbRhDs6JJM22XY0THwL+GPiDdOnN5w9kJMxkun0PB/YFvjvwmpfQXSnZE4Pf2be111iVLpXwPelSGr/PTzI4Jr+H1wMvBW5Ll4I9mf7+/wDfBP4kyd8nOaOVHw4cNDR+/DqPHFfvGli/D1jW0vMOAu6sqsGrOsNj0+lDxz607TdV+2GH0mVF7NIcnBMtEgZUGofhy+CnA88Cjquqlfwk5e5RaXy7/UJVv1dV/5zuy7joUuWg++J9SVXtP7Asq6o7p+jfqA6dXGmpD4cA30lyOPA7dOmHP11dqsZXae+vqu6qql+qqoPofvl6X5Knt7rfrKrn0KXPPRP4z8DddFegjhro+xNr4MbqXZigS/k4ZKDs0IH124HPDJ2X5VX1KwNtdnV+bgcOSLL/rjoxB+dEkmbbtONEVT1UVe+sqmcD/ytwIl1qHcwwhuxi39vprlCtGni9lVU1mXK9x2MTXVr9Q3TjyGvoUtVfRJeCuKa1mfwe/mJVvYIuoPsI8OFWvr2qTq+qpwEvB96a7v7k2+kyJQbP14qqeukIffwucPBQ+v7w2HTu0LGfUFW/P9BmprHpiBH6MdvnRIuEAZUeC1bQBQj3prtZ9h2zcdAkz0ryL1uO9QPtNSZnpLsYOLf9YU+S1Ule0eom6NLWnrabL/mcJCe1X9veQjcwfh7Yj+6LfqK91hvorsZM9vPf5CeTcGxtbR9O8twkx7VfLne29/Bwu/r1O8B7kzy5HePgoTz1KVU3I9+1wMZ2ZfBIfvJHAMAngGcmeX2Sfdvy3HQ3RM+oqr5Ll3/+vnSTjeyb5AVTNJ3VczJK3yRpN007TiT5F0n+Scuk+D5dkDL5XfQ9djF+TLdv+/78E+CCJCvTTYpxRJLjB457SLr7jXbH65I8O8kT6FLT/rCNBSvoxql76LIe/stAH38qyWuTPLGqHmr9fLjVnZhu4qAMlD8M/A3w/XSTQzy+Xe05OslzR+jjX7djnJZkaTvP/2yg/neAX27f/0myX5KXpbs3ahSfAJ6S5C3pJnVakeS4KdrN9jnRImFApceC/xd4PN0vZp+nS1+bDY8D3tOOexfdL0qTs/r8d7obYv8kyfb2usfBj9MIzwX+sqUWPC/dpBQ7sutJKT5Kdw/SVuD1wEntl8i/Ay6gGzC+B/wT4C8H9nsu8IUkO1qf/mNV3QqspBtEttKladwD/Ne2z9vp0gs+39ISbqC7yjeK0+h+ebuLLvXk9+kGEKpqO91N0CcD32ltzqM7l6N6Pd0fCFvo7t16y3CDOTonkjSbph0n6NKg/5Duj+evAZ8BrhrY71XpZjn9TR5tV/v+IvBTwN/Rfc/9Id29W9BNgLAZuCvJ3QDtD/zNM7yPD9Hd/3oXsIxusg3o7lW9je7epb9r72/Q64FvtTHml4HXtfJn0I05O+i+w99XVZ9uQdrL6e6DvpVu7H0/3XizS1X1A+AkukmU7m2v9Ql+MjZtAn6JLuV7K934d+pMxx04/nbg51v/7gL+f6aYpZZZPiej9k/zXx6ZriqpjyQb6W5Gfd1MbR9rkpxHN6nHKTM2liTNG0k+TTfhwvvH3ZfdleQLwMVVdfm4+yLNxCtU0iKT7jlTx7S0iX9G94vgdePulyRp8UpyfJKntJS/U4BjmL2MFWlOPerBZ5IWvBV0aX4H0aXkXUCXrihJ0rg8i26Sh+V0M/K9qt1XJj3mjZzyl+RkuskCDqPLPz21qj7bZjG5qJV/oZXf1vYJ3T0sb2qHuQx4e5lnKEmSJGkBGCnlL8nP092Y/ga6X7dfAPx9klV0M4adRfdk603A1QO7bqB7yvhauku3J9JNgSxJkiRJ896o91C9E3hXVX2+qn7UnsFwJ92MLJur6pr2oNaNwNo2FTPAKcAFVXVHa38BuzEriyRJM0lyWpJNSR5McsU0bd6RpJK8aKAsSc5Lck9bzm+ZFZIkjWzGe6jacxLWAR9L8k26KTc/QvcwzaOAmybbVtXOJLe08i3D9W39KGawatWqWrNmzejvQpI0733pS1+6u6pW99j1O8A5wIvpHsHwCEmOAF5F9/DQQYNZFAX8KfD3dM8f2iXHKUlaXHY1Ro0yKcWBwL50g9Hz6Z4v81HgTLobByeG2m+jSwuk1W8bqlueJMP3USXZQDe4cdhhh7Fp06YRuiZJWiiS3NZnv6q6tu2/DjhkiiYX0j277X1D5T/Oomj7X0D3rJsZA6o1a9Y4TknSIrKrMWqUlL/7239/q6q+W1V3A/8NeCndA8xWDrVfCWxv68P1K4EdU01KUVWXVtW6qlq3enWfHyglSXqkJP8G+EFVfXKK6l5ZFJIkDZoxoKqqrcAddOkQwzbTpUoAkGQ/4IhW/qj6tj7TE70lSdpjSZYD/wV4yzRNps2imOZ4G9q9WpsmJoaTMyRJi9Wok1JcDvyHJE9O8iS6wekTdA8DPTrJ+iTLgLOBm6tqS9vvSuCtSQ5OchBwOnDFrL4DSZKm9k7gQ1V16zT1I2dRgJkUkqSpjRpQvRv4IvAN4GvAl4Fzq2oCWA+cC2wFjgNOHtjvEuDjwFeArwLXtzJJkubaCcCvJbkryV3AocCHk7y91ZtFIUnaY6NMSkFVPQT8aluG624AjnzUTl1dAW9riyRJsy7JUrrxbAmwpGVM/JAuoNp3oOkXgbcCn2rbk1kUn6RLaz8d+K291W9J0sIwUkAlSdJj2JnAOwa2Xwe8s6o2DjZK8jCwtap2tKJLgKfRZVEAvB+zKCRJu8mASpI0r7XAaeMI7dYMbZtFIUnaY6PeQyVJkiRJGmJAJUmSJEk9mfI3hTVnXD/uLozVt97zsnF3QZI0DccoxyhJjy1eoZIkSZKkngyoJEmSJKknAypJkiRJ6smASpIkSZJ6MqCSJEmSpJ4MqCRJkiSpJwMqSZIkSerJgEqSJEmSejKgkiRJkqSeDKgkSZIkqScDKkmSJEnqyYBKkiRJknoyoJIkSZKkngyoJEmSJKknAypJkiRJ6smASpIkSZJ6MqCSJEmSpJ4MqCRJkiSpJwMqSZIkSerJgEqSJEmSejKgkiTNa0lOS7IpyYNJrhgof16SP03yj0kmklyT5KkD9UlyXpJ72nJ+kozlTUiS5i0DKknSfPcd4BzgA0PlTwIuBdYAhwPbgcsH6jcArwTWAscAJwJvnuO+SpIWmKXj7oAkSXuiqq4FSLIOOGSg/FOD7ZJcCHxmoOgU4IKquqPVXwD8EnDxXPdZkrRweIVKkrRYvADYPLB9FHDTwPZNrWxKSTa01MJNExMTc9RFSdJ8M1JAleTTSR5IsqMtXx+oOyHJliT3JbkxyeEDdeanS5LGLskxwNnAfx4oXg5sG9jeBiyfbpyqqkural1VrVu9evXcdVaSNK/szhWq06pqeVueBZBkFXAtcBZwALAJuHpgH/PTJUljleTpwKeA/1hVnx2o2gGsHNheCeyoqtqb/ZMkzW97mvJ3ErC5qq6pqgeAjcDaJEe2+h/np1fVncAFwKl7+JqSJI2kZU3cALy7qj40VL2Z7ge/SWt5ZEqgJEkz2p2A6jeS3J3kL5O8sJU9Iv+8qnYCt/CTHPSR89PNTZck9ZFkaZJlwBJgSZJlrexg4M+Bi6pqqokmrgTemuTgJAcBpwNX7LWOS5IWhFEDqrcDTwMOppuC9uNJjuDR+ee07RVtfeT8dHPTJUk9nQncD5wBvK6tnwm8iW7sesfAPcA7Bva7BPg48BXgq8D1rUySpJGNNG16VX1hYPODSV4NvJRH55/Ttre3dfPTJUlzqqo20qWcT+Wdu9ivgLe1RZKkXvreQ1VAGMo/T7IfcAQ/yUE3P12SJEnSgjVjQJVk/yQvHshJfy3dszz+GLgOODrJ+pa/fjZwc1Vtabubny5JkiRpwRol5W9f4BzgSOBhYAvwyqr6OkCS9cCFwFXAF4CTB/a9hC5//Stt+/2Yny5JkiRpgZgxoKqqCeC5u6i/gS7YmqrO/HRJkiRJC9aePodKkiRJkhYtAypJkiRJ6smASpIkSZJ6MqCSJEmSpJ4MqCRJkiSpJwMqSZIkSerJgEqSJEmSejKgkiRJkqSeDKgkSZIkqScDKkmSJEnqyYBKkiRJknoyoJIkSZKkngyoJEmSJKknAypJkiRJ6smASpIkSZJ6MqCSJEmSpJ4MqCRJkiSpJwMqSZIkSerJgEqSNK8lOS3JpiQPJrliqO6EJFuS3JfkxiSHD9QlyXlJ7mnL+Umy19+AJGleM6CSJM133wHOAT4wWJhkFXAtcBZwALAJuHqgyQbglcBa4BjgRODNe6G/kqQFxIBKkjSvVdW1VfUR4J6hqpOAzVV1TVU9AGwE1iY5stWfAlxQVXdU1Z3ABcCpe6nbkqQFwoBKkrRQHQXcNLlRVTuBW1r5o+rb+lFIkrQbDKgkSQvVcmDbUNk2YMU09duA5dPdR5VkQ7tXa9PExMSsd1aSND8ZUEmSFqodwMqhspXA9mnqVwI7qqqmOlhVXVpV66pq3erVq2e9s5Kk+cmASpK0UG2mm3ACgCT7AUe08kfVt/XNSJK0GwyoJEnzWpKlSZYBS4AlSZYlWQpcBxydZH2rPxu4uaq2tF2vBN6a5OAkBwGnA1eM4S1IkuYxAypJ0nx3JnA/cAbwurZ+ZlVNAOuBc4GtwHHAyQP7XQJ8HPgK8FXg+lYmSdLIlo67A5Ik7Ymq2kg3JfpUdTcAR05TV8Db2iJJUi+7dYUqyTOSPJDkqoEyn0IvSZIkaVHa3ZS/i4AvTm74FHpJkiRJi9nIAVWSk4F7gT8bKPYp9JIkSZIWrZECqiQrgXfRzYA0yKfQS5IkSVq0Rr1C9W7gsqq6fah81p5C7xPoJUmSJM03MwZUSY4FXgS8d4rqWXsKvU+glyRJkjTfjDJt+guBNcC324Wl5XQPTnw2cDHdfVLALp9C/zdt26fQS5IkSVowRkn5u5QuSDq2LRfTPfzwxfgUekmSJEmL2IxXqKrqPuC+ye0kO4AH2hPoSbIeuBC4CvgCj34K/dPonkIP8H58Cr0kSZKkBWKUlL9HaE+kH9z2KfSSJEmSFqXdfbCvJEmSJKkxoJIkSZKkngyoJEmSJKknAypJkiRJ6smASpIkSZJ6MqCSJEmSpJ4MqCRJkiSpJwMqSZIkSerJgEqSJEmSejKgkiRJkqSeDKgkSZIkqScDKkmSJEnqyYBKkiRJknoyoJIkLWhJ1iT5ZJKtSe5KcmGSpa3uhCRbktyX5MYkh4+7v5Kk+cWASpK00L0P+AfgqcCxwPHAryZZBVwLnAUcAGwCrh5XJyVJ85MBlSRpoftfgA9X1QNVdRfwR8BRwEnA5qq6pqoeADYCa5McOb6uSpLmGwMqSdJC99+Bk5M8IcnBwEv4SVB102SjqtoJ3NLKJUkaiQGVJGmh+wxdkPR94A661L6PAMuBbUNttwErpjpIkg1JNiXZNDExMYfdlSTNJwZUkqQFK8k+wB/T3Su1H7AKeBJwHrADWDm0y0pg+1THqqpLq2pdVa1bvXr13HVakjSvGFBJkhayA4BDgQur6sGquge4HHgpsBlYO9kwyX7AEa1ckqSRGFBJkhasqrobuBX4lSRLk+wPnEJ379R1wNFJ1idZBpwN3FxVW8bXY0nSfGNAJUla6E4C/hUwAXwT+CHwf1bVBLAeOBfYChwHnDyuTkqS5qel4+6AJElzqar+FnjhNHU3AE6TLknqzStUkiRJktSTAZUkSZIk9WRAJUmSJEk9GVBJkiRJUk8GVJIkSZLU00gBVZKrknw3yfeTfCPJmwbqTkiyJcl9SW5McvhAXZKcl+SetpyfJHPxRiRJkiRpbxv1CtVvAGuqaiXwC8A5SZ6TZBVwLXAW3dPoNwFXD+y3AXgl3ZPojwFOBN48S32XJEmSpLEaKaCqqs1V9eDkZluOoHtY4uaquqaqHgA2AmuTTD7T4xTggqq6o6ruBC4ATp3F/kuSJEnS2Iz8YN8k76MLhh4PfBn4JN3T5W+abFNVO5PcAhwFbGn/vWngMDe1MkmSJO2mNWdcP+4ujNW33vOycXdBepSRJ6Woql8FVgDPp0vzexBYDmwbarqttWOK+m3A8qnuo0qyIcmmJJsmJiZGfweSJEmSNCa7NctfVT1cVZ8DDgF+BdgBrBxqthLY3taH61cCO6qqpjj2pVW1rqrWrV69ene6JUmSJElj0Xfa9KV091BtpptwAoAk+w2UM1zf1jcjSZIkSQvAjAFVkicnOTnJ8iRLkrwYeDXw58B1wNFJ1idZBpwN3FxVW9ruVwJvTXJwkoOA04Er5uSdSJIkSdJeNsqkFEWX3ncxXQB2G/CWqvooQJL1wIXAVcAXgJMH9r0EeBrwlbb9/lYmSZIkSfPejAFVVU0Ax++i/gbgyGnqCnhbWyRJkiRpQel7D5UkSZIkLXoGVJIkSZLUkwGVJEmSJPVkQCVJkiRJPRlQSZIkSVJPBlSSJEmS1JMBlSRJkiT1ZEAlSZIkST0ZUEmSFrwkJyf5WpKdSW5J8vxWfkKSLUnuS3JjksPH3VdJ0vxiQCVJWtCS/DxwHvAGYAXwAuDvk6wCrgXOAg4ANgFXj6ufkqT5aem4OyBJ0hx7J/Cuqvp8274TIMkGYHNVXdO2NwJ3JzmyqraMpaeSpHnHK1SSpAUryRJgHbA6yTeT3JHkwiSPB44CbppsW1U7gVta+VTH2pBkU5JNExMTe6P7kqR5wCtU0pA1Z1w/7i6Mzbfe87Jxd0GabQcC+wKvAp4PPAR8FDgTWA4MR0bb6NICH6WqLgUuBVi3bl3NUX8lSfOMV6gkSQvZ/e2/v1VV362qu4H/BrwU2AGsHGq/Eti+F/snSZrnDKgkSQtWVW0F7gCmuqK0GVg7uZFkP+CIVi5J0kgMqCRJC93lwH9I8uQkTwLeAnwCuA44Osn6JMuAs4GbnZBCkrQ7DKgkSQvdu4EvAt8AvgZ8GTi3qiaA9cC5wFbgOODkcXVSkjQ/OSmFJGlBq6qHgF9ty3DdDcCRe71TkqQFw4BKkiRJmgecifixyZQ/SZIkSerJgEqSJEmSejKgkiRJkqSeDKgkSZIkqScDKkmSJEnqyVn+JKlZzLMnwWN7BiVJkh6rvEIlSZIkST0ZUEmSJElSTwZUkiRJktSTAZUkSZIk9TRjQJXkcUkuS3Jbku1JvpzkJQP1JyTZkuS+JDcmOXygLknOS3JPW85Pkrl6M5IkSZK0N41yhWopcDtwPPBE4Czgw0nWJFkFXNvKDgA2AVcP7LsBeCWwFjgGOBF486z1XpIkSZLGaMZp06tqJ7BxoOgTSW4FngP8NLC5qq4BSLIRuDvJkVW1BTgFuKCq7mj1FwC/BFw8m29CkiRJksZht++hSnIg8ExgM3AUcNNkXQu+bmnlDNe39aOYQpINSTYl2TQxMbG73ZIkSZKkvW63Aqok+wK/C3ywXYFaDmwbarYNWNHWh+u3Acunuo+qqi6tqnVVtW716tW70y1JkiRJGouRA6ok+wAfAn4AnNaKdwArh5quBLZPU78S2FFV1au3kiRJkvQYMlJA1a4oXQYcCKyvqoda1Wa6CScm2+0HHNHKH1Xf1jcjSZIkSQvAqFeofhv4GeDlVXX/QPl1wNFJ1idZBpwN3NzSAQGuBN6a5OAkBwGnA1fMTtclSZIkabxGeQ7V4XRTnR8L3JVkR1teW1UTwHrgXGArcBxw8sDulwAfB74CfBW4vpVJkiRJ0rw3yrTptwHTPoy3qm4AjpymroC3tUWSJEmSFpTdnjZdkiRJktQxoJIkLQpJnpHkgSRXDZSdkGRLkvuS3NjS3CVJGpkBlSRpsbgI+OLkRpJVwLXAWcABwCbg6vF0TZI0XxlQSZIWvCQnA/cCfzZQfBKwuaquqaoHgI3A2iRT3hcsSdJUDKgkSQtakpXAu+ge3THoKOCmyY2q2gnc0solSRqJAZUkaaF7N3BZVd0+VL4c2DZUtg1YMdVBkmxIsinJpomJiTnopiRpPjKgkiQtWEmOBV4EvHeK6h3AyqGylcD2qY5VVZdW1bqqWrd69erZ7agkad6a8TlUkiTNYy8E1gDfTgLdVaklSZ4NXAycMtkwyX7AEcDmvd5LSdK85RUqSdJCdildkHRsWy4GrgdeDFwHHJ1kfZJlwNnAzVW1ZVydlSTNP16hkiQtWFV1H3Df5HaSHcADVTXRttcDFwJXAV8ATh5HPyVJ85cBlSRp0aiqjUPbNwBOky5J6s2UP0mSJEnqyYBKkiRJknoyoJIkSZKkngyoJEmSJKknAypJkiRJ6smASpIkSZJ6MqCSJEmSpJ4MqCRJkiSpJwMqSZIkSerJgEqSJEmSejKgkiRJkqSeDKgkSZIkqScDKkmSJEnqyYBKkiRJknoyoJIkSZKkngyoJEmSJKknAypJkiRJ6mmkgCrJaUk2JXkwyRVDdSck2ZLkviQ3Jjl8oC5JzktyT1vOT5JZfg+SJEmSNBajXqH6DnAO8IHBwiSrgGuBs4ADgE3A1QNNNgCvBNYCxwAnAm/esy5LkiRJ0mPDSAFVVV1bVR8B7hmqOgnYXFXXVNUDwEZgbZIjW/0pwAVVdUdV3QlcAJw6Kz2XJEmSpDHb03uojgJumtyoqp3ALa38UfVt/SgkSZIkaQHY04BqObBtqGwbsGKa+m3A8qnuo0qyod2ntWliYmIPuyVJkiRJc29PA6odwMqhspXA9mnqVwI7qqqGD1RVl1bVuqpat3r16j3sliRJkiTNvT0NqDbTTTgBQJL9gCNa+aPq2/pmJEnaC3UuYUMAAAWtSURBVJI8LsllSW5Lsj3Jl5O8ZKB+2plqJUkaxajTpi9NsgxYAixJsizJUuA64Ogk61v92cDNVbWl7Xol8NYkByc5CDgduGLW34UkSVNbCtwOHA88kW5W2g8nWTPCTLWSJM1o1CtUZwL3A2cAr2vrZ1bVBLAeOBfYChwHnDyw3yXAx4GvAF8Frm9lkiTNuaraWVUbq+pbVfWjqvoEcCvwHGaeqVaSpBktHaVRVW2kG2imqrsBmHLwafdKva0tkiSNVZIDgWfSpZ//CkMz1SaZnKl2y9RHkCTpkfb0HipJkuaFJPsCvwt8sKWmzzRT7fD+zkYrSXoUAypJ0oKXZB/gQ8APgNNa8Uwz1T6Cs9FKkqZiQCVJWtDasw8vAw4E1lfVQ61qpplqJUmakQGVJGmh+23gZ4CXV9X9A+UzzVQrSdKMDKgkSQtWe67Um4FjgbuS7GjLa0eYqVaSpBmNNMufJEnzUVXdBmQX9dPOVCtJ0ii8QiVJkiRJPRlQSZIkSVJPBlSSJEmS1JMBlSRJkiT1ZEAlSZIkST0ZUEmSJElSTwZUkiRJktSTAZUkSZIk9WRAJUmSJEk9GVBJkiRJUk8GVJIkSZLUkwGVJEmSJPVkQCVJkiRJPRlQSZIkSVJPBlSSJEmS1JMBlSRJkiT1ZEAlSZIkST0ZUEmSJElSTwZUkiRJktSTAZUkSZIk9WRAJUmSJEk9GVBJkiRJUk8GVJIkSZLU05wHVEkOSHJdkp1Jbkvymrl+TUmSRuU4JUnaE0v3wmtcBPwAOBA4Frg+yU1VtXkvvLYkSTNxnJIk9TanV6iS7AesB86qqh1V9TngY8Dr5/J1JUkaheOUJGlPparm7uDJzwJ/VVWPHyj7T8DxVfXyobYbgA1t81nA1+esY499q4C7x90JjYWf/eK22D//w6tq9d58QcepXhb7v9PFzs9/8Vrsn/20Y9Rcp/wtB7YNlW0DVgw3rKpLgUvnuD/zQpJNVbVu3P3Q3udnv7j5+Y+F49Ru8t/p4ubnv3j52U9vriel2AGsHCpbCWyf49eVJGkUjlOSpD0y1wHVN4ClSZ4xULYW8EZfSdJjgeOUJGmPzGlAVVU7gWuBdyXZL8n/BrwC+NBcvu4CsOhTShYxP/vFzc9/L3Oc6sV/p4ubn//i5Wc/jTmdlAK653sAHwB+HrgHOKOqfm9OX1SSpBE5TkmS9sScB1SSJEmStFDN9T1UkiRJkrRgGVBJkiRJUk8GVGOW5GeTvCrJE5IsSXJakvcmOXHcfZM0d5IcluRfJ3nmFHWvHkefpKk4TkmLk+PU6AyoxijJvwM+Cfwm8BfA24Gj6B40+ftJ3jjG7mmM2h8tZ4+7H5obSf4V8FVgI/C3Sd6XZMlAk0vG0jFpiOOUpuM4tbA5Tu0eJ6UYoyRbgF8AAnwN+OdV9Vet7sXA+VW1doxd1JgkeRxwX1UtmbGx5p0kXwLOrqrrkxwIXAU8CJxUVT9Isr2qVoy3l5LjlKbnOLWwOU7tHgOqMUqyraqe2NZ3AsurfSBJ9gH+sar2H2cfNXeSfGAX1UuB1zpQLUyD/++37aV0g9Uquj9ev+dApccCx6nFzXFq8XKc2j2m/I3XziT7tvUr6pHR7eOBH42hT9p7XgPcD9w5xXLHGPulubc1yaGTG1X1Q+DVwLeBGwD/QNFjhePU4uY4tXg5Tu2GpePuwCL3Z8DTga9V1b8fqjsRuHnvd0l70VeAP66qjw1XJFkGnLH3u6S95AbgDcC7JgvaH6pvTHIx8LxxdUwa4ji1uDlOLV6OU7vBlL/HqCSr6f7t3j3uvmhuJPn3wJ1V9ZEp6pYAZ1bVO/d+zzTXkvwUsLSq7pum/rCq+vZe7pa0WxynFj7HqcXLcWr3GFBJkiRJUk/eQyVJkiRJPRlQSZIkSVJPBlSSJEmS1JMBlSRJkiT1ZEAlSZIkST39T3dYNG7egGcIAAAAAElFTkSuQmCC\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Check the class balance\n",
"train_pclass_value_counts = df_train.pclass.value_counts()\n",
"test_pclass_value_counts = df_test.pclass.value_counts()\n",
"\n",
"plt.figure(figsize=(12, 4))\n",
"\n",
"plt.subplot(121)\n",
"plt.title('Train set: passenger class')\n",
"train_pclass_value_counts.plot.bar()\n",
"\n",
"plt.subplot(122)\n",
"plt.title('Test set: passenger class')\n",
"test_pclass_value_counts.plot.bar()\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the above diagnostics, we are satisfied that, at least in these few categories, the train and test are similar enough, and we can move forward.\n",
"\n",
"## Feature engineering\n",
"\n",
"In this section we will use `vaex` to create meaningful features that will be used to train a classification model. To start with, let's get a high level overview of the training data."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:38.527108Z",
"start_time": "2020-05-01T17:12:38.408602Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
pclass
\n",
"
survived
\n",
"
name
\n",
"
sex
\n",
"
age
\n",
"
sibsp
\n",
"
parch
\n",
"
ticket
\n",
"
fare
\n",
"
cabin
\n",
"
embarked
\n",
"
boat
\n",
"
body
\n",
"
home_dest
\n",
"
\n",
" \n",
" \n",
"
\n",
"
dtype
\n",
"
int64
\n",
"
bool
\n",
"
str
\n",
"
str
\n",
"
float64
\n",
"
int64
\n",
"
int64
\n",
"
str
\n",
"
float64
\n",
"
str
\n",
"
str
\n",
"
str
\n",
"
float64
\n",
"
str
\n",
"
\n",
"
\n",
"
count
\n",
"
1047
\n",
"
1047
\n",
"
1047
\n",
"
1047
\n",
"
841
\n",
"
1047
\n",
"
1047
\n",
"
1047
\n",
"
1046
\n",
"
233
\n",
"
1046
\n",
"
380
\n",
"
102
\n",
"
592
\n",
"
\n",
"
\n",
"
NA
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
206
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
1
\n",
"
814
\n",
"
1
\n",
"
667
\n",
"
945
\n",
"
455
\n",
"
\n",
"
\n",
"
mean
\n",
"
2.3075453677172875
\n",
"
0.3744030563514804
\n",
"
--
\n",
"
--
\n",
"
29.565299286563608
\n",
"
0.5100286532951289
\n",
"
0.3982808022922636
\n",
"
--
\n",
"
32.926091013384294
\n",
"
--
\n",
"
--
\n",
"
--
\n",
"
159.6764705882353
\n",
"
--
\n",
"
\n",
"
\n",
"
std
\n",
"
0.833269
\n",
"
0.483968
\n",
"
--
\n",
"
--
\n",
"
14.162
\n",
"
1.07131
\n",
"
0.890852
\n",
"
--
\n",
"
50.6783
\n",
"
--
\n",
"
--
\n",
"
--
\n",
"
96.2208
\n",
"
--
\n",
"
\n",
"
\n",
"
min
\n",
"
1
\n",
"
False
\n",
"
--
\n",
"
--
\n",
"
0.1667
\n",
"
0
\n",
"
0
\n",
"
--
\n",
"
0
\n",
"
--
\n",
"
--
\n",
"
--
\n",
"
1
\n",
"
--
\n",
"
\n",
"
\n",
"
max
\n",
"
3
\n",
"
True
\n",
"
--
\n",
"
--
\n",
"
80
\n",
"
8
\n",
"
9
\n",
"
--
\n",
"
512.329
\n",
"
--
\n",
"
--
\n",
"
--
\n",
"
327
\n",
"
--
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass survived name sex age \\\n",
"dtype int64 bool str str float64 \n",
"count 1047 1047 1047 1047 841 \n",
"NA 0 0 0 0 206 \n",
"mean 2.3075453677172875 0.3744030563514804 -- -- 29.565299286563608 \n",
"std 0.833269 0.483968 -- -- 14.162 \n",
"min 1 False -- -- 0.1667 \n",
"max 3 True -- -- 80 \n",
"\n",
" sibsp parch ticket fare \\\n",
"dtype int64 int64 str float64 \n",
"count 1047 1047 1047 1046 \n",
"NA 0 0 0 1 \n",
"mean 0.5100286532951289 0.3982808022922636 -- 32.926091013384294 \n",
"std 1.07131 0.890852 -- 50.6783 \n",
"min 0 0 -- 0 \n",
"max 8 9 -- 512.329 \n",
"\n",
" cabin embarked boat body home_dest \n",
"dtype str str str float64 str \n",
"count 233 1046 380 102 592 \n",
"NA 814 1 667 945 455 \n",
"mean -- -- -- 159.6764705882353 -- \n",
"std -- -- -- 96.2208 -- \n",
"min -- -- -- 1 -- \n",
"max -- -- -- 327 -- "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Imputing\n",
"\n",
"We notice that there are 3 columns that have missing data, so our first task will be to impute the missing values with suitable substitutes. This is our strategy:\n",
"\n",
"- age: impute with the median age value\n",
"- fare: impute with the mean fare of the 5 most common values.\n",
"- cabin: impute with \"M\" for \"Missing\"\n",
"- Embarked: Impute with with the most common value."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:38.546371Z",
"start_time": "2020-05-01T17:12:38.529144Z"
}
},
"outputs": [],
"source": [
"# Handle missing values\n",
"\n",
"# Age - just do the mean of the training set for now\n",
"median_age = df_train.percentile_approx(expression='age', percentage=50.0)\n",
"df_train['age'] = df_train.age.fillna(value=median_age)\n",
"\n",
"# Fare: the mean of the 5 most common ticket prices.\n",
"fill_fares = df_train.fare.value_counts(dropna=True)\n",
"fill_fare = fill_fares.iloc[:5].index.values.mean()\n",
"df_train['fare'] = df_train.fare.fillna(value=fill_fare)\n",
"\n",
"# Cabing: this is a string column so let's mark it as \"M\" for \"Missing\"\n",
"df_train['cabin'] = df_train.cabin.fillna(value='M')\n",
"\n",
"# Embarked: Similar as for Cabin, let's mark the missing values with \"U\" for unknown\n",
"fill_embarked = df_train.embarked.value_counts(dropna=True).index[0]\n",
"df_train['embarked'] = df_train.embarked.fillna(value=fill_embarked)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### String processing\n",
"\n",
"Next up, let's engineer some new, more meaningful features out of the \"raw\" data that is present in the dataset. \n",
"Starting with the name of the passengers, we are going to extract the titles, as well as we are going to count the number of words a name contains. These features can be a loose proxy to the age and status of the passengers."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:38.587351Z",
"start_time": "2020-05-01T17:12:38.548452Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Expression = name_title\n",
"Length: 1,047 dtype: str (column)\n",
"---------------------------------\n",
" 0 Mr\n",
" 1 Mr\n",
" 2 Mrs\n",
" 3 Miss\n",
" 4 Mr\n",
" ... \n",
"1042 Master\n",
"1043 Mrs\n",
"1044 Master\n",
"1045 Mr\n",
"1046 Mr"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Expression = name_num_words\n",
"Length: 1,047 dtype: int64 (column)\n",
"-----------------------------------\n",
" 0 3\n",
" 1 4\n",
" 2 5\n",
" 3 4\n",
" 4 4\n",
" ... \n",
"1042 4\n",
"1043 6\n",
"1044 4\n",
"1045 4\n",
"1046 3"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Engineer features from the names\n",
"\n",
"# Titles\n",
"df_train['name_title'] = df_train['name'].str.replace('.* ([A-Z][a-z]+)\\..*', \"\\\\1\", regex=True)\n",
"display(df_train['name_title'])\n",
"\n",
"# Number of words in the name\n",
"df_train['name_num_words'] = df_train['name'].str.count(\"[ ]+\", regex=True) + 1\n",
"display(df_train['name_num_words'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the cabin colum, we will engineer 3 features:\n",
" - \"deck\": extacting the deck on which the cabin is located, which is encoded in each cabin value;\n",
" - \"multi_cabin: a boolean feature indicating whether a passenger is allocated more than one cabin\n",
" - \"has_cabin\": since there were plenty of values in the original cabin column that had missing values, we are just going to build a feature which tells us whether a passenger had an assigned cabin or not."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:38.747634Z",
"start_time": "2020-05-01T17:12:38.594540Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Expression = deck\n",
"Length: 1,047 dtype: str (column)\n",
"---------------------------------\n",
" 0 M\n",
" 1 B\n",
" 2 M\n",
" 3 M\n",
" 4 M\n",
" ... \n",
"1042 M\n",
"1043 M\n",
"1044 M\n",
"1045 B\n",
"1046 M"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Expression = multi_cabin\n",
"Length: 1,047 dtype: int64 (column)\n",
"-----------------------------------\n",
" 0 0\n",
" 1 0\n",
" 2 0\n",
" 3 0\n",
" 4 0\n",
" ... \n",
"1042 0\n",
"1043 0\n",
"1044 0\n",
"1045 1\n",
"1046 0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Expression = has_cabin\n",
"Length: 1,047 dtype: int64 (column)\n",
"-----------------------------------\n",
" 0 1\n",
" 1 1\n",
" 2 1\n",
" 3 1\n",
" 4 1\n",
" ... \n",
"1042 1\n",
"1043 1\n",
"1044 1\n",
"1045 1\n",
"1046 1"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Extract the deck\n",
"df_train['deck'] = df_train.cabin.str.slice(start=0, stop=1)\n",
"display(df_train['deck'])\n",
"\n",
"# Passengers under which name have several rooms booked, these are all for 1st class passengers\n",
"df_train['multi_cabin'] = ((df_train.cabin.str.count(pat='[A-Z]', regex=True) > 1) &\\\n",
" ~(df_train.deck == 'F')).astype('int')\n",
"display(df_train['multi_cabin'])\n",
"\n",
"# Out of these, cabin has the most missing values, so let's create a feature tracking if a passenger had a cabin\n",
"df_train['has_cabin'] = df_train.cabin.notna().astype('int')\n",
"display(df_train['has_cabin'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### More features\n",
"\n",
"There are two features that give an indication whether a passenger is travelling alone, or with a famly. \n",
"These are the \"sibsp\" and \"parch\" columns that tell us the number of siblinds or spouses and the number of parents or children each passenger has on-board respectively. We are going to use this information to build two columns:\n",
" - \"family_size\" the size of the family of each passenger;\n",
" - \"is_alone\" an additional boolean feature which indicates whether a passenger is traveling without their family. "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:38.813132Z",
"start_time": "2020-05-01T17:12:38.750219Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Expression = family_size\n",
"Length: 1,047 dtype: int64 (column)\n",
"-----------------------------------\n",
" 0 1\n",
" 1 1\n",
" 2 3\n",
" 3 4\n",
" 4 1\n",
" ... \n",
"1042 8\n",
"1043 2\n",
"1044 3\n",
"1045 2\n",
"1046 1"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Expression = is_alone\n",
"Length: 1,047 dtype: int64 (column)\n",
"-----------------------------------\n",
" 0 0\n",
" 1 0\n",
" 2 0\n",
" 3 0\n",
" 4 0\n",
" ... \n",
"1042 0\n",
"1043 0\n",
"1044 0\n",
"1045 0\n",
"1046 0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Size of family that are on board: passenger + number of siblings, spouses, parents, children. \n",
"df_train['family_size'] = (df_train.sibsp + df_train.parch + 1)\n",
"display(df_train['family_size'])\n",
"\n",
"# Whether or not a passenger is alone\n",
"df_train['is_alone'] = (df_train.family_size == 0).astype('int')\n",
"display(df_train['is_alone'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, let's create two new features:\n",
" - age $\\times$ class\n",
" - fare per family member, i.e. fare $/$ family_size"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:38.831478Z",
"start_time": "2020-05-01T17:12:38.823592Z"
}
},
"outputs": [],
"source": [
"# Create new features\n",
"df_train['age_times_class'] = df_train.age * df_train.pclass\n",
"\n",
"# fare per person in the family\n",
"df_train['fare_per_family_member'] = df_train.fare / df_train.family_size"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modeling (part 1): gradient boosted trees\n",
"\n",
"Since this dataset contains a lot of categorical features, we will start with a tree based model. This we will gear the following feature pre-processing towards the use of tree-based models.\n",
"\n",
"### Feature pre-processing for boosted tree models\n",
"\n",
"The features \"sex\", \"embarked\", and \"deck\" can be simply label encoded. The feature \"name_tite\" contains certain a larger degree of cardinality, relative to the size of the training set, and in this case we will use the Frequency Encoder."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:38.983682Z",
"start_time": "2020-05-01T17:12:38.833258Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:MainThread:numexpr.utils:NumExpr defaulting to 4 threads.\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
#
pclass
survived
name
sex
age
sibsp
parch
ticket
fare
cabin
embarked
boat
body
home_dest
name_title
name_num_words
deck
multi_cabin
has_cabin
family_size
is_alone
age_times_class
fare_per_family_member
label_encoded_sex
label_encoded_embarked
label_encoded_deck
frequency_encoded_name_title
\n",
"\n",
"\n",
"
0
3
False
Stoytcheff, Mr. Ilia
male
19.0
0
0
349205
7.8958
M
S
None
nan
None
Mr
3
M
0
1
1
0
57.0
7.8958
0
0
0
0.5787965616045845
\n",
"
1
1
False
Payne, Mr. Vivian Ponsonby
male
23.0
0
0
12749
93.5
B24
S
None
nan
Montreal, PQ
Mr
4
B
0
1
1
0
23.0
93.5
0
0
1
0.5787965616045845
\n",
"
2
3
True
Abbott, Mrs. Stanton (Rosa Hunt)
female
35.0
1
1
C.A. 2673
20.25
M
S
A
nan
East Providence, RI
Mrs
5
M
0
1
3
0
105.0
6.75
1
0
0
0.1451766953199618
\n",
"
3
2
True
Hocking, Miss. Ellen \"Nellie\"
female
20.0
2
1
29105
23.0
M
S
4
nan
Cornwall / Akron, OH
Miss
4
M
0
1
4
0
40.0
5.75
1
0
0
0.20152817574021012
\n",
"
4
3
False
Nilsson, Mr. August Ferdinand
male
21.0
0
0
350410
7.8542
M
S
None
nan
None
Mr
4
M
0
1
1
0
63.0
7.8542
0
0
0
0.5787965616045845
\n",
"
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
\n",
"
1,042
3
False
Goodwin, Master. Sidney Leonard
male
1.0
5
2
CA 2144
46.9
M
S
None
nan
Wiltshire, England Niagara Falls, NY
Master
4
M
0
1
8
0
3.0
5.8625
0
0
0
0.045845272206303724
\n",
"
1,043
3
False
Ahlin, Mrs. Johan (Johanna Persdotter Larsson)
female
40.0
1
0
7546
9.475
M
S
None
nan
Sweden Akeley, MN
Mrs
6
M
0
1
2
0
120.0
4.7375
1
0
0
0.1451766953199618
\n",
"
1,044
3
True
Johnson, Master. Harold Theodor
male
4.0
1
1
347742
11.1333
M
S
15
nan
None
Master
4
M
0
1
3
0
12.0
3.7111
0
0
0
0.045845272206303724
\n",
"
1,045
1
False
Baxter, Mr. Quigg Edmond
male
24.0
0
1
PC 17558
247.5208
B58 B60
C
None
nan
Montreal, PQ
Mr
4
B
1
1
2
0
24.0
123.7604
0
2
1
0.5787965616045845
\n",
"
1,046
3
False
Coleff, Mr. Satio
male
24.0
0
0
349209
7.4958
M
S
None
nan
None
Mr
3
M
0
1
1
0
72.0
7.4958
0
0
0
0.5787965616045845
\n",
"\n",
"
"
],
"text/plain": [
"# pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home_dest name_title name_num_words deck multi_cabin has_cabin family_size is_alone age_times_class fare_per_family_member label_encoded_sex label_encoded_embarked label_encoded_deck frequency_encoded_name_title\n",
"0 3 False Stoytcheff, Mr. Ilia male 19.0 0 0 349205 7.8958 M S None nan None Mr 3 M 0 1 1 0 57.0 7.8958 0 0 0 0.5787965616045845\n",
"1 1 False Payne, Mr. Vivian Ponsonby male 23.0 0 0 12749 93.5 B24 S None nan Montreal, PQ Mr 4 B 0 1 1 0 23.0 93.5 0 0 1 0.5787965616045845\n",
"2 3 True Abbott, Mrs. Stanton (Rosa Hunt) female 35.0 1 1 C.A. 2673 20.25 M S A nan East Providence, RI Mrs 5 M 0 1 3 0 105.0 6.75 1 0 0 0.1451766953199618\n",
"3 2 True Hocking, Miss. Ellen \"Nellie\" female 20.0 2 1 29105 23.0 M S 4 nan Cornwall / Akron, OH Miss 4 M 0 1 4 0 40.0 5.75 1 0 0 0.20152817574021012\n",
"4 3 False Nilsson, Mr. August Ferdinand male 21.0 0 0 350410 7.8542 M S None nan None Mr 4 M 0 1 1 0 63.0 7.8542 0 0 0 0.5787965616045845\n",
"... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n",
"1,042 3 False Goodwin, Master. Sidney Leonard male 1.0 5 2 CA 2144 46.9 M S None nan Wiltshire, England Niagara Falls, NY Master 4 M 0 1 8 0 3.0 5.8625 0 0 0 0.045845272206303724\n",
"1,043 3 False Ahlin, Mrs. Johan (Johanna Persdotter Larsson) female 40.0 1 0 7546 9.475 M S None nan Sweden Akeley, MN Mrs 6 M 0 1 2 0 120.0 4.7375 1 0 0 0.1451766953199618\n",
"1,044 3 True Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 M S 15 nan None Master 4 M 0 1 3 0 12.0 3.7111 0 0 0 0.045845272206303724\n",
"1,045 1 False Baxter, Mr. Quigg Edmond male 24.0 0 1 PC 17558 247.5208 B58 B60 C None nan Montreal, PQ Mr 4 B 1 1 2 0 24.0 123.7604 0 2 1 0.5787965616045845\n",
"1,046 3 False Coleff, Mr. Satio male 24.0 0 0 349209 7.4958 M S None nan None Mr 3 M 0 1 1 0 72.0 7.4958 0 0 0 0.5787965616045845"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"label_encoder = vaex.ml.LabelEncoder(features=['sex', 'embarked', 'deck'], allow_unseen=True)\n",
"df_train = label_encoder.fit_transform(df_train)\n",
"\n",
"# While doing a transform, previously unseen values will be encoded as \"zero\".\n",
"frequency_encoder = vaex.ml.FrequencyEncoder(features=['name_title'], unseen='zero')\n",
"df_train = frequency_encoder.fit_transform(df_train)\n",
"df_train"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once all the categorical data is encoded, we can select the features we are going to use for training the model."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:39.052837Z",
"start_time": "2020-05-01T17:12:38.986328Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
#
label_encoded_sex
label_encoded_embarked
label_encoded_deck
frequency_encoded_name_title
multi_cabin
name_num_words
has_cabin
is_alone
family_size
age_times_class
fare_per_family_member
age
fare
\n",
"\n",
"\n",
"
0
0
0
0
0.578797
0
3
1
0
1
57
7.8958
19
7.8958
\n",
"
1
0
0
1
0.578797
0
4
1
0
1
23
93.5
23
93.5
\n",
"
2
1
0
0
0.145177
0
5
1
0
3
105
6.75
35
20.25
\n",
"
3
1
0
0
0.201528
0
4
1
0
4
40
5.75
20
23
\n",
"
4
0
0
0
0.578797
0
4
1
0
1
63
7.8542
21
7.8542
\n",
"\n",
"
"
],
"text/plain": [
" # label_encoded_sex label_encoded_embarked label_encoded_deck frequency_encoded_name_title multi_cabin name_num_words has_cabin is_alone family_size age_times_class fare_per_family_member age fare\n",
" 0 0 0 0 0.578797 0 3 1 0 1 57 7.8958 19 7.8958\n",
" 1 0 0 1 0.578797 0 4 1 0 1 23 93.5 23 93.5\n",
" 2 1 0 0 0.145177 0 5 1 0 3 105 6.75 35 20.25\n",
" 3 1 0 0 0.201528 0 4 1 0 4 40 5.75 20 23\n",
" 4 0 0 0 0.578797 0 4 1 0 1 63 7.8542 21 7.8542"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# features to use for the trainin of the boosting model\n",
"encoded_features = df_train.get_column_names(regex='^freque|^label')\n",
"features = encoded_features + ['multi_cabin', 'name_num_words', \n",
" 'has_cabin', 'is_alone', \n",
" 'family_size', 'age_times_class',\n",
" 'fare_per_family_member',\n",
" 'age', 'fare']\n",
"\n",
"# Preview the feature matrix\n",
"df_train[features].head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Estimator: [xgboost](https://xgboost.readthedocs.io/en/latest/)\n",
"\n",
"Now let's feed this data into an a tree based estimator. In this example we will use [xgboost](https://xgboost.readthedocs.io/en/latest/). In principle, any algorithm that follows the [scikit-learn](https://scikit-learn.org/stable/) API convention, i.e. it contains the `.fit`, `.predict` methods is compatable with `vaex`. However, the data will be materialized, i.e. will be read into memory before it is passed on to the estimators. We are hard at work trying to make at least some of the estimators from [scikit-learn](https://scikit-learn.org/stable/) run out-of-core!\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:40.968831Z",
"start_time": "2020-05-01T17:12:39.055474Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
#
pclass
survived
name
sex
age
sibsp
parch
ticket
fare
cabin
embarked
boat
body
home_dest
name_title
name_num_words
deck
multi_cabin
has_cabin
family_size
is_alone
age_times_class
fare_per_family_member
label_encoded_sex
label_encoded_embarked
label_encoded_deck
frequency_encoded_name_title
prediction_xgb
\n",
"\n",
"\n",
"
0
3
False
Stoytcheff, Mr. Ilia
male
19.0
0
0
349205
7.8958
M
S
None
nan
None
Mr
3
M
0
1
1
0
57.0
7.8958
0
0
0
0.5787965616045845
False
\n",
"
1
1
False
Payne, Mr. Vivian Ponsonby
male
23.0
0
0
12749
93.5
B24
S
None
nan
Montreal, PQ
Mr
4
B
0
1
1
0
23.0
93.5
0
0
1
0.5787965616045845
False
\n",
"
2
3
True
Abbott, Mrs. Stanton (Rosa Hunt)
female
35.0
1
1
C.A. 2673
20.25
M
S
A
nan
East Providence, RI
Mrs
5
M
0
1
3
0
105.0
6.75
1
0
0
0.1451766953199618
True
\n",
"
3
2
True
Hocking, Miss. Ellen \"Nellie\"
female
20.0
2
1
29105
23.0
M
S
4
nan
Cornwall / Akron, OH
Miss
4
M
0
1
4
0
40.0
5.75
1
0
0
0.20152817574021012
True
\n",
"
4
3
False
Nilsson, Mr. August Ferdinand
male
21.0
0
0
350410
7.8542
M
S
None
nan
None
Mr
4
M
0
1
1
0
63.0
7.8542
0
0
0
0.5787965616045845
False
\n",
"
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
\n",
"
1,042
3
False
Goodwin, Master. Sidney Leonard
male
1.0
5
2
CA 2144
46.9
M
S
None
nan
Wiltshire, England Niagara Falls, NY
Master
4
M
0
1
8
0
3.0
5.8625
0
0
0
0.045845272206303724
False
\n",
"
1,043
3
False
Ahlin, Mrs. Johan (Johanna Persdotter Larsson)
female
40.0
1
0
7546
9.475
M
S
None
nan
Sweden Akeley, MN
Mrs
6
M
0
1
2
0
120.0
4.7375
1
0
0
0.1451766953199618
False
\n",
"
1,044
3
True
Johnson, Master. Harold Theodor
male
4.0
1
1
347742
11.1333
M
S
15
nan
None
Master
4
M
0
1
3
0
12.0
3.7111
0
0
0
0.045845272206303724
True
\n",
"
1,045
1
False
Baxter, Mr. Quigg Edmond
male
24.0
0
1
PC 17558
247.5208
B58 B60
C
None
nan
Montreal, PQ
Mr
4
B
1
1
2
0
24.0
123.7604
0
2
1
0.5787965616045845
False
\n",
"
1,046
3
False
Coleff, Mr. Satio
male
24.0
0
0
349209
7.4958
M
S
None
nan
None
Mr
3
M
0
1
1
0
72.0
7.4958
0
0
0
0.5787965616045845
False
\n",
"\n",
"
"
],
"text/plain": [
"# pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home_dest name_title name_num_words deck multi_cabin has_cabin family_size is_alone age_times_class fare_per_family_member label_encoded_sex label_encoded_embarked label_encoded_deck frequency_encoded_name_title prediction_xgb\n",
"0 3 False Stoytcheff, Mr. Ilia male 19.0 0 0 349205 7.8958 M S None nan None Mr 3 M 0 1 1 0 57.0 7.8958 0 0 0 0.5787965616045845 False\n",
"1 1 False Payne, Mr. Vivian Ponsonby male 23.0 0 0 12749 93.5 B24 S None nan Montreal, PQ Mr 4 B 0 1 1 0 23.0 93.5 0 0 1 0.5787965616045845 False\n",
"2 3 True Abbott, Mrs. Stanton (Rosa Hunt) female 35.0 1 1 C.A. 2673 20.25 M S A nan East Providence, RI Mrs 5 M 0 1 3 0 105.0 6.75 1 0 0 0.1451766953199618 True\n",
"3 2 True Hocking, Miss. Ellen \"Nellie\" female 20.0 2 1 29105 23.0 M S 4 nan Cornwall / Akron, OH Miss 4 M 0 1 4 0 40.0 5.75 1 0 0 0.20152817574021012 True\n",
"4 3 False Nilsson, Mr. August Ferdinand male 21.0 0 0 350410 7.8542 M S None nan None Mr 4 M 0 1 1 0 63.0 7.8542 0 0 0 0.5787965616045845 False\n",
"... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n",
"1,042 3 False Goodwin, Master. Sidney Leonard male 1.0 5 2 CA 2144 46.9 M S None nan Wiltshire, England Niagara Falls, NY Master 4 M 0 1 8 0 3.0 5.8625 0 0 0 0.045845272206303724 False\n",
"1,043 3 False Ahlin, Mrs. Johan (Johanna Persdotter Larsson) female 40.0 1 0 7546 9.475 M S None nan Sweden Akeley, MN Mrs 6 M 0 1 2 0 120.0 4.7375 1 0 0 0.1451766953199618 False\n",
"1,044 3 True Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 M S 15 nan None Master 4 M 0 1 3 0 12.0 3.7111 0 0 0 0.045845272206303724 True\n",
"1,045 1 False Baxter, Mr. Quigg Edmond male 24.0 0 1 PC 17558 247.5208 B58 B60 C None nan Montreal, PQ Mr 4 B 1 1 2 0 24.0 123.7604 0 2 1 0.5787965616045845 False\n",
"1,046 3 False Coleff, Mr. Satio male 24.0 0 0 349209 7.4958 M S None nan None Mr 3 M 0 1 1 0 72.0 7.4958 0 0 0 0.5787965616045845 False"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import xgboost\n",
"import vaex.ml.sklearn\n",
"\n",
"# Instantiate the xgboost model normally, using the scikit-learn API\n",
"xgb_model = xgboost.sklearn.XGBClassifier(max_depth=11,\n",
" learning_rate=0.1, \n",
" n_estimators=500, \n",
" subsample=0.75, \n",
" colsample_bylevel=1, \n",
" colsample_bytree=1,\n",
" scale_pos_weight=1.5,\n",
" reg_lambda=1.5, \n",
" reg_alpha=5, \n",
" n_jobs=-1,\n",
" random_state=42,\n",
" verbosity=0)\n",
"\n",
"# Make it work with vaex (for the automagic pipeline and lazy predictions)\n",
"vaex_xgb_model = vaex.ml.sklearn.Predictor(features=features,\n",
" target='survived',\n",
" model=xgb_model, \n",
" prediction_name='prediction_xgb')\n",
"# Train the model\n",
"vaex_xgb_model.fit(df_train)\n",
"# Get the prediction of the model on the training data\n",
"df_train = vaex_xgb_model.transform(df_train)\n",
"\n",
"# Preview the resulting train dataframe that contans the predictions\n",
"df_train"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that in the above cell block, we call `.transform` on the `vaex_xgb_model` object. This adds the \"prediction_xgb\" column as _virtual column_ in the output dataframe. This can be quite convenient when calculating various metrics and making diagnosic plots. Of course, one can call a `.predict` on the `vaex_xgb_model` object, which returns an in-memory `numpy` array object housing the predictions.\n",
"\n",
"### Performance on training set\n",
"\n",
"Anyway, let's see what the performance is of the model on the training set. First let's create a convenience function that will help us get multiple metrics at once."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:40.985268Z",
"start_time": "2020-05-01T17:12:40.975947Z"
}
},
"outputs": [],
"source": [
"from sklearn.metrics import accuracy_score, f1_score, roc_auc_score\n",
"def binary_metrics(y_true, y_pred):\n",
" acc = accuracy_score(y_true=y_true, y_pred=y_pred)\n",
" f1 = f1_score(y_true=y_true, y_pred=y_pred)\n",
" roc = roc_auc_score(y_true=y_true, y_score=y_pred)\n",
" print(f'Accuracy: {acc:.3f}')\n",
" print(f'f1 score: {f1:.3f}')\n",
" print(f'roc-auc: {roc:.3f}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's check the performance of the model on the training set."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:41.088203Z",
"start_time": "2020-05-01T17:12:40.988951Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Metrics for the training set:\n",
"Accuracy: 0.924\n",
"f1 score: 0.896\n",
"roc-auc: 0.914\n"
]
}
],
"source": [
"print('Metrics for the training set:')\n",
"binary_metrics(y_true=df_train.survived.values, y_pred=df_train.prediction_xgb.values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Automatic pipelines\n",
"\n",
"Now, let's inspect the performance of the model on the test set. You probably noticed that, unlike when using other libraries, we did not bother to create a pipeline while doing all the cleaning, inputing, feature engineering and categorial encoding. Well, we did not _explicitly_ create a pipeline. In fact `veax` keeps track of all the changes one applies to a DataFrame in something called a state. A state is the place which contains all the informations regarding, for instance, the virtual columns we've created, which includes the newly engineered features, the categorically encoded columns, and even the model prediction! So all we need to do, is to extract the state from the training DataFrame, and apply it to the test DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:41.299459Z",
"start_time": "2020-05-01T17:12:41.093866Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
#
pclass
survived
name
sex
age
sibsp
parch
ticket
fare
cabin
embarked
boat
body
home_dest
name_title
name_num_words
deck
multi_cabin
has_cabin
family_size
is_alone
age_times_class
fare_per_family_member
label_encoded_sex
label_encoded_embarked
label_encoded_deck
frequency_encoded_name_title
prediction_xgb
\n",
"\n",
"\n",
"
0
3
False
O'Connor, Mr. Patrick
male
28.032
0
0
366713
7.75
M
Q
None
nan
None
Mr
3
M
0
1
1
0
84.096
7.75
0
1
0
0.578797
False
\n",
"
1
3
False
Canavan, Mr. Patrick
male
21
0
0
364858
7.75
M
Q
None
nan
Ireland Philadelphia, PA
Mr
3
M
0
1
1
0
63
7.75
0
1
0
0.578797
False
\n",
"
2
1
False
Ovies y Rodriguez, Mr. Servando
male
28.5
0
0
PC 17562
27.7208
D43
C
None
189
?Havana, Cuba
Mr
5
D
0
1
1
0
28.5
27.7208
0
2
4
0.578797
True
\n",
"
3
3
False
Windelov, Mr. Einar
male
21
0
0
SOTON/OQ 3101317
7.25
M
S
None
nan
None
Mr
3
M
0
1
1
0
63
7.25
0
0
0
0.578797
False
\n",
"
4
2
True
Shelley, Mrs. William (Imanita Parrish Hall)
female
25
0
1
230433
26
M
S
12
nan
Deer Lodge, MT
Mrs
6
M
0
1
2
0
50
13
1
0
0
0.145177
True
\n",
"\n",
"
"
],
"text/plain": [
" # pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home_dest name_title name_num_words deck multi_cabin has_cabin family_size is_alone age_times_class fare_per_family_member label_encoded_sex label_encoded_embarked label_encoded_deck frequency_encoded_name_title prediction_xgb\n",
" 0 3 False O'Connor, Mr. Patrick male 28.032 0 0 366713 7.75 M Q None nan None Mr 3 M 0 1 1 0 84.096 7.75 0 1 0 0.578797 False\n",
" 1 3 False Canavan, Mr. Patrick male 21 0 0 364858 7.75 M Q None nan Ireland Philadelphia, PA Mr 3 M 0 1 1 0 63 7.75 0 1 0 0.578797 False\n",
" 2 1 False Ovies y Rodriguez, Mr. Servando male 28.5 0 0 PC 17562 27.7208 D43 C None 189 ?Havana, Cuba Mr 5 D 0 1 1 0 28.5 27.7208 0 2 4 0.578797 True\n",
" 3 3 False Windelov, Mr. Einar male 21 0 0 SOTON/OQ 3101317 7.25 M S None nan None Mr 3 M 0 1 1 0 63 7.25 0 0 0 0.578797 False\n",
" 4 2 True Shelley, Mrs. William (Imanita Parrish Hall) female 25 0 1 230433 26 M S 12 nan Deer Lodge, MT Mrs 6 M 0 1 2 0 50 13 1 0 0 0.145177 True"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# state transfer to the test set\n",
"state = df_train.state_get()\n",
"df_test.state_set(state)\n",
"\n",
"# Preview of the \"transformed\" test set\n",
"df_test.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that once we apply the state from the train to the test set, the test DataFrame contains all the features we created or modified in the training data, and even the predictions of the xgboost model!\n",
"\n",
"The state is a simple Python dictionary, which can be easily stored as JSON to disk, which makes it very easy to deploy.\n",
"\n",
"### Performance on test set\n",
"\n",
"Now it is trivial to check the model performance on the test set:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:41.381884Z",
"start_time": "2020-05-01T17:12:41.310025Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Metrics for the test set:\n",
"Accuracy: 0.798\n",
"f1 score: 0.744\n",
"roc-auc: 0.785\n"
]
}
],
"source": [
"print('Metrics for the test set:')\n",
"binary_metrics(y_true=df_test.survived.values, y_pred=df_test.prediction_xgb.values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Feature importance\n",
"Let's now look at the feature importance of the `xgboost` model."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:41.911379Z",
"start_time": "2020-05-01T17:12:41.384369Z"
},
"scrolled": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAh8AAAIbCAYAAABLzPzHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nOzdeZgdZZ3+//dN2AkGMCgmhsSRRQfxC9iAOKPAoCJGBBdQENmUzXEAf6yDjiIKwggCGkcWQQWU1UFQYFgGgyiLdGQzosMWCGEnZCMsWe7fH/W0Ho7dfU53uqvT5H5dV1+eU/Usn6qTmbr7qTqNbBMRERFRl+WGuoCIiIhYtiR8RERERK0SPiIiIqJWCR8RERFRq4SPiIiIqFXCR0RERNQq4SMi4jVC0rqS5kkaMdS1RPQm4SMiYikj6dOSbpf0gqSny+svSFJv/Ww/anuk7UV11RrRHwkfERFLEUmHAacD3wbWAd4IHAj8E7DiEJYWMWCUv3AaEbF0kDQKeBzY0/bPe2gzEfgm8FZgNnCO7WPLvgnAw8AKthdKmgzcDPwL8E7gVmB3288O6oFEtJCVj4iIpcdWwErAFb20eQHYE1gDmAgcJGnnXtrvDuwDvIFq5eTwgSk1ov8SPiIilh6jgWdtL+zaIOkWSbMkvSjpfbYn277X9mLb9wAXAlv3MuaPbP+f7ReBS4BNBvcQIlpL+IiIWHo8B4yWtHzXBtvvsb1G2becpC0l/VrSM5JmUz0PMrqXMZ9seD0fGDkYhUf0RcJHRMTS41bgZWCnXtr8DLgSGGd7FHAG0Ou3YCKWNgkfERFLCduzgK8D/yXpk5JGSlpO0ibAaqXZ6sBM2y9J2oLqmY6IYWX51k0iIqIutv9T0gzgSOA8qgdMHwKOAm4BvgCcImkScBPVcxxrDFG5Ef2Sr9pGRERErXLbJSIiImqV8BERERG1SviIiIiIWiV8RERERK0SPiIiIqJW+aptRBtGjx7tCRMmDHUZERHDypQpU561vXbz9oSPiDZMmDCBzs7OoS4jImJYkfRId9tz2yUiIiJqlfARERERtUr4iIiIiFolfEREREStEj4iIiKiVgkfERERUauEj4iIiKhVwkdERETUKuEjIiIiapXwEREREbVK+IiIiIhaJXxERERErRI+IiIiolYJHxEREVGrhI+IiIioVcJHRERE1CrhIyIiImqV8BERERG1SviIiIiIWiV8RERERK2WH+oCXkskTQM+b/uGFu0MrG/7gX7M0e++dZL0Y+Ax21+ps+9guXfGbCYcfdVQlxERUatpJ04clHGz8hERERG1SviIiIiIWiV8DAJJW0i6VdIsSU9ImiRpxaZmH5b0kKRnJX1b0nIN/feVdJ+k5yVdK2l8H+dfSdLJkh6V9JSkMyStUvZtI+kxSYdJerrUt09D31UknSLpEUmzJf22oe9HJU0txzVZ0tsb+m0q6Q+S5kq6GFi5qaaPSLqr9L1F0jvb7dvDMY6W9Ksy3kxJN3edQ0ljJP1c0jOSHpZ0cNm+Vjn2Hcv7kZIekLRnD3PsL6lTUuei+bPb/wAiIqJXCR+DYxHwJWA0sBWwHfCFpjYfAzqAzYCdgH0BJO0MHAN8HFgbuBm4sI/znwRsAGwCrAeMBb7asH8dYFTZ/jng+5LWLPtOBt4FvAdYCzgSWCxpg1LHoaWuq4FfSlqxBKtfAOeXPpcCn+iaTNJmwLnAAcDrgTOBK0tI6rVvLw4DHiu1vJHqnLkEkF8Cd5fj2w44VNL2tmdSneezJb0BOBW4y/Z53U1g+yzbHbY7Rqw6qo2SIiKiHQkfg8D2FNu32V5oexrVxXbrpmYn2Z5p+1HgNGC3sv0A4Fu277O9EDgB2KTd1Q9JAvYDvlTGn1vG+HRDswXAcbYX2L4amAdsWC7c+wKH2J5he5HtW2y/DHwKuMr29bYXUIWUVahCyruBFYDTypiXAXc0zLcfcKbt28uYPwFeLv1a9e3JAuBNwPjS72bbBjYH1rZ9nO1XbD8EnN11/Lavowo4/wtMLOc7IiJqlPAxCCRtUG4JPClpDtXFf3RTs+kNrx8BxpTX44HTy+2EWcBMQFS/xbdjbWBVYErDGP9Ttnd5rgSbLvOBkaXGlYEHuxl3TKkTANuLyzGMLftmlIt/4zF1GQ8c1lVPqWlc6deqb0++DTwAXFduXx3dMNeYprmOoVod6XIW8A7gR7afa2OuiIgYQPmq7eD4AXAnsJvtuZIOBT7Z1GYcMLW8Xhd4vLyeDhxv+6f9nPtZ4EVgI9sz+tH3JeCtVLctGj0ObNz1pqywjANmAAbGSlJDiFiXv4WYrmM6vnlCSVu36NutsqJzGFWo2Qj4taQ7ylwP216/u36SRlCtRJ0HHCTpR+18bXnjsaPoHKSvnEVELGuy8jE4VgfmAPMkvQ04qJs2R0haU9I44BDg4rL9DODfywUVSaMk7dLuxGVF4mzg1PJcA5LGStq+zb7nAt8pD22OkLSVpJWAS4CJkraTtALVhf9l4BbgVmAhcLCk5SV9HNiiYeizgQMlbanKapImSlq9jb7dKg+wrldC0Byq52wWAb8H5kg6qjw8O0LSOyRtXroeU/53X6pbR+eVQBIRETVJ+BgchwO7A3OpLrwXd9PmCmAKcBdwFXAOgO3LqR4YvajcsvkjsEMf5z+K6pbEbWWMG4AN+1D7vVTPXcwstSxn+y/AHsD3qFZIdgR2LM9VvEL1gOzewPNUz4f8d9eAtjupnvuYVPY/UNrSqm8v1i/HNY8qwPyX7cm2F5XaNgEeLrX+EBgl6V3A/wfsWdqdRLVqc3Q340dExCDRq2+1R0R3Ojo63NnZOdRlREQMK5Km2O5o3p6Vj4iIiKhVwscwVf7Y17xufj4z1LUNFEnH9HCM1wx1bRER0X/5tsswZXujoa5hsNk+gepryhER8RqSlY+IiIioVcJHRERE1CrhIyIiImqV8BERERG1SviIiIiIWiV8RERERK0SPiIiIqJWCR8RERFRq4SPiIiIqFXCR0RERNQq4SMiIiJqlfARERERtUr4iIiIiFolfEREREStEj4iIiKiVgkfERERUauEj4iIiKhVwkdERETUKuEjIiIiapXwEREREbVK+IiIiIhaJXxERERErZZv1UDShsBFwHrAl21/d9Crir8j6cfAY7a/UmffpZmk9wI/tL1hD/snAA8DK9heuCRz3TtjNhOOvmpJhoiI6NW0EycOdQm1aWfl40hgsu3VEzxiKEmypPW63tu+uTF4SJom6f1DU11ERLSrnfAxHpja3Q5JIwa2nIiIiHit6zV8SLoR2BaYJGmepJ9J+oGkqyW9AGwraYykn0t6RtLDkg5u6L+KpB9Lel7SnyQdIemxhv2v+k22tP1mw/uPSLpL0ixJt0h6Z8O+aZIOl3SPpNmSLpa0csP+nUrfOZIelPQhSbtImtJ0jIdJ+kWL87CSpJMlPSrpKUlnSFql7NtG0mNlnKclPSFpn6ZzcIqkR0qdv23o+1FJU8vxTZb09oZ+m0r6g6S5ki4GVm6qqbdz02vfHo6x1XFMlHRnOZ/TJR3bsG9C+Sz3Kfuel3SgpM3L5zNL0qSm+faVdF9pe62k8S3q+015eXf5t/iprprL/vOBdYFflv1HdjPGKEnnlGObIembCdAREfXrNXzY/hfgZuCLtkcCrwC7A8cDqwO3AL8E7gbGAtsBh0ravgzxNeCt5Wd7YK92C5O0GXAucADweuBM4EpJKzU02xX4EPAW4J3A3qXvFsB5wBHAGsD7gGnAlcBbGi/ywB7A+S3KOQnYANiE6tmXscBXG/avA4wq2z8HfF/SmmXfycC7gPcAa1HdxlosaQPgQuBQYG3gaqoL54qSVgR+UepaC7gU+EQ756ZV3xZ6O44XgD2pzudE4CBJOzf13xJYH/gUcBrwZeD9wEbArpK2LvXvDBwDfLwc+83lXPTI9vvKy/9ne6Tti5v2fxZ4FNix7P/Pbob5CbCQ6jPcFPgg8Pme5pS0v6ROSZ2L5s/urbyIiOiD/nzb5Qrbv7O9GNgYWNv2cbZfsf0QcDbw6dJ2V+B42zNtTwf68szIfsCZtm+3vcj2T4CXgXc3tPmu7cdtz6QKQZuU7Z8DzrV9ve3FtmfY/rPtl4GLqQIHkjYCJgC/6qkISSq1fKkcx1zghIZjBFgAHGd7ge2rgXnAhpKWA/YFDik1LLJ9S6njU8BVpcYFVCFlFaqQ8m5gBeC0MuZlwB1tnptWfXvT7XEA2J5s+95yPu+hCgtbN/X/hu2XbF9HFVYutP207RlUAWPT0u4A4Fu27ysPgp4AbNJq9WNJSHojsANwqO0XbD8NnMqrP8dXsX2W7Q7bHSNWHTVYpUVELHNaftulG9MbXo8Hxkia1bBtBNWFBmBMU/tH+jDPeGAvSf/WsG3FMmaXJxtez2/YN45qJaE7PwEulPQV4LPAJSUM9GRtYFVgSpVDABDVcXZ5runbFPOBkcBoqlseD3Yz7hgazoftxZKmU606LAJm2HZD+8Zz19u5cYu+venpOJC0JXAi8I4y10pUqyqNnmp4/WI370c21H+6pFMa9ovq2Pvyb6QvxlOFsicaPsflePW/z4iIqEF/wkfjRW068LDt9Xto+wRVEOh6YHXdpv3zqS7sXdYBup4JmU61anJ8P2qcTnWr5+/Yvk3SK8B7qW4h7d5irGepLpwbld/g++JZ4KVSy91N+x6nWjkC/rrCMg6YQXWOx0pSQ4hYl7+FmB7PTbm10Vvf/voZMAnYwfZLkk6jClf90VX/T5ewpmbuZd90qtWh0Uv6tduIiFgy/QkfjX4PzJF0FNUtlVeAtwOr2L4DuAT4d0m3A6sB/9bU/y5gd0lTgQ9QLeN3ln1nA5dLuqHMsyqwDfCbcuujN+cA10n6FfBr4E3A6rb/XPafR3UhXWj7t70NVFYkzgZOlfRF209LGgu8w/a1bfQ9F/iOpM9SrQRsAfyB6twcLWk74DfAIVQXx1tK94XAwZK+D3y09Pt1q3MD3Nqib3+tDswswWMLqtB2XT/HOgP4hqS7bE+VNAr4oO3mlZRmTwH/ADzQYv/fsf2EpOuAUyT9B9UtpbcAb7Z9U6uCNx47is5l6Dv4ERGDaYn+wqntRcCOVM9aPEz1m/4PqR5aBPg61TL6w1QXquYHOw8p/WcBn6F6ULJr7E6qZxsmAc9TXXD2brOu3wP7UN3Tnw3cRLXs3uV8qtsHrR407XJUmf82SXOAGyjPQrThcOBequcuZlI9vLqc7b9QPXvyParztiPVw5Kv2H6F6mHMvamO/VPAfzccX4/nplXfJfAF4DhJc6ketr2kvwPZvpzqPFxUzucfqZ7HaOVY4Cfl2zO7drP/W8BXyv7Du9m/J9Utoz9RnZvLqIJpRETUSK9+NGCQJ5O2AS6w/ebaJu2+jlWAp4HNbN8/lLXE8NDR0eHOzs7WDSMi4q8kTbHd0bx9Wf1vuxwE3JHgERERUb9lLnxImkZ1u+ewpu1Tyx+nav75zJAUOggkHdPDMV4z1LVB9d9q6aG+eUNdW0REDJxab7tEDFe57RIR0Xe57RIRERFLhYSPiIiIqFXCR0RERNQq4SMiIiJqlfARERERtUr4iIiIiFolfEREREStEj4iIiKiVgkfERERUauEj4iIiKhVwkdERETUKuEjIiIiapXwEREREbVK+IiIiIhaJXxERERErRI+IiIiolYJHxEREVGrhI+IiIioVcJHRERE1CrhIyIiImqV8BERERG1SviIiIiIWiV8BJI2lHSnpLmSDh7AcT8j6bqG95a03kCN3zDuupLmSRox0GNHRMTAk+2hriGGmKRzgDm2vzTI8xhY3/YDgznPYFjpTev7TXudNtRlDFvTTpw41CVExBCQNMV2R/P2rHwEwHhg6lAXERERy4aEj2WcpBuBbYFJ5dbFIeUWzBxJ0yUd29B2Qrl1sk/Z97ykAyVtLukeSbMkTWpov7ek33Yz5+aSnpK0fMO2T0i6q0WtW0jqLLU9Jek7TXUtL2mrchxdPy9JmlbaLSfpaEkPSnpO0iWS1lrScxgREX2T8LGMs/0vwM3AF22PBO4G9gTWACYCB0nauanblsD6wKeA04AvA+8HNgJ2lbR1iznvAJ4DPtCweQ/g/Bblng6cbvt1wFuBS7oZ+1bbI8uxrAncBlxYdh8M7AxsDYwBnge+39NkkvYvYadz0fzZLUqLiIh2JXzEq9iebPte24tt30N14W4OE9+w/ZLt64AXgAttP217BlWQ2bSNqX5CFTgoqw/bAz9r0WcBsJ6k0bbn2b6tRfvvlvq+XN4fAHzZ9mO2XwaOBT7ZuALTyPZZtjtsd4xYdVQbhxQREe1I+IhXkbSlpF9LekbSbOBAYHRTs6caXr/YzfuRbUx1AbCjpJHArsDNtp9o0edzwAbAnyXdIekjvRzHAcA2wO62F5fN44HLy+2hWcB9wCLgjW3UGxERAyThI5r9DLgSGGd7FHAGoIGepKyS3Ap8DPgsrW+5YPt+27sBbwBOAi6TtFpzO0nvBb4B7GS78X7JdGAH22s0/KxcaomIiJp0u9wcy7TVgZm2X5K0BbA7cF2LPv11HnA0ZUWiVWNJewDX2n6mrFxAtXLR2GYccDGwp+3/axriDOB4SXvZfkTS2sB7bF/Rau6Nx46iM18XjYgYEFn5iGZfAI6TNBf4Kt081DmALqcED9svtNH+Q8BUSfOoHj79tO2XmtpsB6xDtSrS9Y2Xrq8Rn061qnNdOb7bqB6ejYiIGuWPjMWQkvQgcIDtG4a6lt50dHS4s7NzqMuIiBhW8kfGYqkj6ROAgRuHupaIiKhPwkcMCUmTgR8A/9rwbRQkXdP0R8K6fo4ZsmIjImJA5YHTGBK2t+lh+w41lxIRETXLykdERETUKuEjIiIiapXwEREREbVK+IiIiIhaJXxERERErRI+IiIiolYJHxEREVGrhI+IiIioVcJHRERE1CrhIyIiImqV8BERERG1SviIiIiIWiV8RERERK0SPiIiIqJWCR8RERFRq4SPiIiIqFXCR0RERNQq4SMiIiJqlfARERERtUr4iIiIiFolfEREREStEj4iIiKiVgkfERERUavlh7qA4UTSNODztm9o0c7A+rYf6Mcc/e5bJ0k/Bh6z/ZW6+g7EuZE0GbjA9g/70u/eGbOZcPRV/Z32NWHaiROHuoSIeI3IykdERETUKuEjIiIiapXw0Q+StpB0q6RZkp6QNEnSik3NPizpIUnPSvq2pOUa+u8r6T5Jz0u6VtL4Ps6/kqSTJT0q6SlJZ0hapezbRtJjkg6T9HSpb5+GvqtIOkXSI5JmS/ptQ9+PSppajmuypLc39NtU0h8kzZV0MbByU00fkXRX6XuLpHe227eX4zyi1P+4pH3bPQdl/06lnjmSHpT0oW7Gf5OkeyQd3k49ERExMBI++mcR8CVgNLAVsB3whaY2HwM6gM2AnYB9ASTtDBwDfBxYG7gZuLCP858EbABsAqwHjAW+2rB/HWBU2f454PuS1iz7TgbeBbwHWAs4ElgsaYNSx6GlrquBX0pasQSrXwDnlz6XAp/omkzSZsC5wAHA64EzgStLQOi1b09KWDgc+ACwPvD+ds+BpC2A84AjgDWA9wHTmsafANwETLJ9cg817C+pU1LnovmzW5UcERFtSvjoB9tTbN9me6HtaVQX262bmp1ke6btR4HTgN3K9gOAb9m+z/ZC4ARgk3ZXPyQJ2A/4Uhl/bhnj0w3NFgDH2V5g+2pgHrBhWX3ZFzjE9gzbi2zfYvtl4FPAVbavt72AKqSsQhVS3g2sAJxWxrwMuKNhvv2AM23fXsb8CfBy6deqb092BX5k+4+2XwCO7cM5+BxwbjmWxeVY/9ww9j8Ck4Gv2T6rpwJsn2W7w3bHiFVHtVFyRES0I9926YeySvAdqpWNVanO45SmZtMbXj8CjCmvxwOnSzqlcUiq39wfaWP6tcucU6pr8F/7j2ho81wJNl3mAyOpVmpWBh7sZtwxjfPbXixpeqlrETDDtpuOqct4YC9J/9awbcUyplv07ckYXn1OG/u0OgfjqFZuevIZ4AHgsjbqiIiIAZbw0T8/AO4EdrM9V9KhwCeb2owDppbX6wKPl9fTgeNt/7Sfcz8LvAhsZHtGP/q+BLwVuLtp3+PAxl1vyurCOGAGVYAYK0kNIWJd/hZiuo7p+OYJJW3dom9Pnijzd1m36Th6OwfTyzH25FjgQ8DPJH3a9qIWtbDx2FF05qumEREDIrdd+md1YA4wT9LbgIO6aXOEpDUljQMOAS4u288A/l3SRgCSRknapd2JbS8GzgZOlfSGMsZYSdu32fdc4DuSxkgaIWkrSSsBlwATJW0naQXgMKpbJ7cAtwILgYMlLS/p48AWDUOfDRwoaUtVVpM0UdLqbfTtySXA3pL+UdKqwNf6cA7OAfYpx7Jc2fe2hrEXALsAqwHnq+Fh4IiIGHz5f7r9cziwOzCX6iJ4cTdtrqC6bXAXcBXVBRHbl1M9LHmRpDnAH4Ed+jj/UVS3DW4rY9wAbNiH2u+leu5iZqllOdt/AfYAvke1srAjsKPtV2y/QvWA7N7A81TPh/x314C2O6mewZhU9j9Q2tKqb09sX0P1rMyNZbwb2z0Htn8P7AOcCsymerD0Vc/UNNT1BuDcBJCIiPro1bfiI6I7HR0d7uzsHOoyIiKGFUlTbHc0b89vexEREVGrhI+lVPljX/O6+fnMUNc2UCQd08MxXjPUtUVExODJt12WUrY3GuoaBpvtE6j+PkdERCxDsvIRERERtUr4iIiIiFolfEREREStEj4iIiKiVgkfERERUauEj4iIiKhVwkdERETUKuEjIiIiapXwEREREbVK+IiIiIhaJXxERERErRI+IiIiolYJHxEREVGrhI+IiIioVcJHRERE1CrhIyIiImqV8BERERG1SviIiIiIWiV8RERERK0SPiIiIqJWCR8RERFRq4SPiIiIqFXCRwAg6QxJ/zHUdfRG0jaSHhvqOiIiYsksP9QFRP0k7Q183vY/d22zfeDQVbT0u3fGbCYcfdVQl9G2aSdOHOoSIiJ6lJWPiIiIqFXCx1JC0tGSHpQ0V9KfJH2sbB8h6RRJz0p6WNIXJVnS8mX/KEnnSHpC0gxJ35Q0opd53g6cAWwlaZ6kWWX7jyV9s7zeRtJjko6U9HQZe2dJH5b0f5JmSjqmYczlGup/TtIlktYq+1aWdEHZPkvSHZLe2OJcrCXpR5Iel/S8pF/05ZyVfetJuknS7HLuLi7bJenUclyzJd0j6R3tfUoRETEQcttl6fEg8F7gSWAX4AJJ6wE7ATsAmwAvAJc29fsJ8BSwHrAa8CtgOnBmd5PYvk/SgTTddunGOsDKwFhgb+Bs4HrgXcC6wBRJF9l+CDgY2BnYGngG+C7wfWA3YC9gFDAOeLkcx4stzsX5wDxgo/K/7+mhXbfnzPYTwDeA64BtgRWBjtLng8D7gA2A2cDbgFndDS5pf2B/gBGvW7tFyRER0a6sfCwlbF9q+3Hbi21fDNwPbAHsCpxu+zHbzwMndvUpKwg7AIfafsH208CpwKcHoKQFwPG2FwAXAaNLHXNtTwWmAu8sbQ8AvlxqfBk4FvhkWZ1ZALweWM/2IttTbM/paVJJbyrHdKDt520vsH1Td217OWdd9Y8Hxth+yfZvG7avThU6ZPu+Ela6G/8s2x22O0asOqqNUxYREe1I+FhKSNpT0l3l1sQs4B1UF/wxVCsZXRpfjwdWAJ5o6Hcm8IYBKOk524vK666Viqca9r8IjGyo4/KGGu4DFgFvpFrFuBa4qNxG+U9JK/Qy7zhgZglaverlnAEcCQj4vaSpkvYFsH0jMIlqZeYpSWdJel2ruSIiYuAkfCwFJI2nuq3xReD1ttcA/kh18XwCeHND83ENr6dT3coYbXuN8vM62xu1mNIDV/1f69ihoYY1bK9se0ZZufi67X+kun3yEWDPFmOtJWmN3iZscc6w/aTt/WyPoVqZ+a9yGwvb37X9LqrbOhsARyzJwUdERN/kmY+lw2pUgeAZAEn7UP0WD3AJcIikq6ie+Tiqq5PtJyRdB5xS/kbHPOAtwJt7ulVRPAW8WdKKtl8ZgPrPAI6XtJftRyStDbzH9hWStgWeBf4EzKG67bGop4HKMV1DFRb+tRzTVrZ/09S0t3OGpF2AW20/Bjxf2i6StDlV6P4D1fl8qbd6umw8dhSd+fpqRMSAyMrHUsD2n4BTgFupgsHGwO/K7rOpHpy8B7gTuBpYyN8umHtSPVD5J6qL7GXAm1pMeSPVMxtPSnp2AA7hdOBK4DpJc4HbgC3LvnVKTXOobsfcBFzQYrzPUoWUPwNPA4c2N2hxzgA2B26XNK/Udojth4HXUZ3T54FHgOeAk/t2uBERsSRkD/QKfAwmSTsAZ9geP9S1LEs6Ojrc2dk51GVERAwrkqbY7mjenpWPpZykVcrf11he0ljga8DlQ11XREREfyV8LP0EfJ3qNsGdVLcuvtqyU/XfapnXzc8Zg1xvW3qobZ6k9w51bRERMbjywOlSzvZ8qucX+trvQGCp/e+12B7ZulVERLwWZeUjIiIiapXwEREREbVK+IiIiIhaJXxERERErRI+IiIiolYJHxEREVGrhI+IiIioVcJHRERE1CrhIyIiImqV8BERERG1SviIiIiIWiV8RERERK0SPiIiIqJWCR8RERFRq4SPiIiIqFXCR0RERNQq4SMiIiJqlfARERERtUr4iIiIiFolfEREREStEj4iIiKiVgkfERERUathGz4kbSjpTklzJR081PUMJEn/JOl+SfMk7TzAYx8j6Yfl9QRJlrT8QM5RF0mTJX1+qOuIiIi+GZYXneJIYLLtTYe6kEFwHDDJ9ukDPbDtEwZ6zGXBvTNmM+Hoq4a6jLZNO3HiUJcQEdGjYbvyAYwHpva109LwW34bNfTr2KJ/VBnO/7cQETGsDMv/hyvpRmBbYFK5NXFIuQUzR9J0Scc2tO26tfA5SY8CN5bt+0q6T9Lzkq6VNL6NeS3pYEkPSXpW0rcbL1q9jVn6/quk+4H7e5njQeAfgF+WY1tJ0j5l3Lll7gMa2m8j6TFJR0p6WtITknaW9GFJ/ydppqRjGtofK31wLe4AACAASURBVOmCbubdRdKUpm2HSfpFi3PyY0n/JemaUu/vJK0j6bRyHv4sadOG9mMk/VzSM5IebrxlVmq7VNIF5VjvlbSBpH8vxzZd0gebSnirpN9Lmi3pCklrNYz3bkm3SJol6W5J2zTsmyzpeEm/A+aXcx4RETUYluHD9r8ANwNftD0SuBvYE1gDmAgc1M2zElsDbwe2L/uOAT4OrF3GurDN6T8GdACbATsB+wK0OebOwJbAP/ZybG8FHgV2tD3S9svA08BHgNcB+wCnStqsods6wMrAWOCrwNnAHsC7gPcCX5XU6uJ6JfAWSW9v2LYHcH6LfgC7Al8BRgMvA7cCfyjvLwO+A1CC2i+pPq+xwHbAoZK2bxhrxzLnmsCdwLVU/07HUt2OOrNp7j2pPoMxwELgu2WuscBVwDeBtYDDgZ9LWruh72eB/YHVgUfaOM6IiBgAwzJ8NLM92fa9thfbvofqor91U7Njbb9g+0XgAOBbtu+zvRA4AdikndUP4CTbM20/CpwG7Fa2tzPmt0rfF/t4fFfZftCVm4DrqEJFlwXA8bYXABdRXfRPtz3X9lSqWzjvbDHHy8DFVIEDSRsBE4BftVHi5ban2H4JuBx4yfZ5theVMbtWPjYH1rZ9nO1XbD9EFZQ+3TDWzbavLefwUqogd2LDsU2QtEZD+/Nt/9H2C8B/ALtKGlGO42rbV5d/F9cDncCHG/r+2PZU2wvL+K8iaX9JnZI6F82f3cZpiIiIdrwmwoekLSX9uizlzwYOpLoAN5re8Ho8cHpZjp8FzARE9dt1K43jPEL1G3e7Yzb2bZukHSTdVm6hzKK6gDYe33PlQg/QFWyeatj/IjCyjal+AuwuSVSrApeUUNJK81w9zT0eGNN1jsqxHAO8sZexnu3m2BqPpfnzWIHq3IwHdmma65+BN/XQ9+/YPst2h+2OEauO6q1pRET0wZA/fDlAfgZMAnaw/ZKk0/j78OGG19OpVgp+2o+5xvG3h0HXBR7vw5juZV+3JK0E/Jzq9sIVtheU5zDU17FasX2bpFeoVlV2Lz8DaTrwsO31B3DMcQ2v16VaBXq2zHW+7f166dvnzyMiIpbcayV8rA7MLMFjC6qL5nW9tD8D+Iaku2xPlTQK+KDtS9uY6whJt1P99n0I5XmGJRyzNysCKwHPAAsl7QB8EPjjEo7bk/OogtxC278d4LF/D8yRdBTVsxmvUD2Hs4rtO/o55h6SzgOmUT0TcpntReWh2jvK8yQ3UK2IvBt4wPZjfZ1k47Gj6MzXVyMiBsRr4rYL8AXgOElzqR64vKS3xrYvB04CLpI0h+pCvkObc10BTAHuonqg8ZwBGLO3WucCB1Md0/NUwerKJR23F+cD76C9B037pNw+2RHYBHiYaoXih8CS3NM4H/gx8CTVQ7cHl7mmUz0QfAxVcJsOHMFr5998RMSwJTsrz+2SZGB92w8MdS2DRdIqVN+u2cx2j18JXtZ0dHS4s7NzqMuIiBhWJE2x3dG8Pb8FRrODgDsSPCIiYrC8Vp75GBCS3gtc092+8vdEhtU8fSVpGtWDrDs3bZ9K9e2RZgf086HdiIhYhiV8NLB9M718JdX2gHzDpNU8Q8X2hB62b1RzKRER8RqW2y4RERFRq4SPiIiIqFXCR0RERNQq4SMiIiJqlfARERERtUr4iIiIiFolfEREREStEj4iIiKiVgkfERERUauEj4iIiKhVwkdERETUKuEjIiIiapXwEREREbVK+IiIiIhaJXxERERErRI+IiIiolYJHxEREVGrhI+IiIioVcJHRERE1CrhIyIiImqV8BERERG1SviIiIiIWiV8RERERK0SPiIiIqJWyw91Ae2SNA34vO0bWrQzsL7tB/oxR7/71knSj4HHbH+lzr79IWkC8DCwgu2FAzTmgHxOkiYDF9j+Yau2986YzYSjr1qS6Wox7cSJQ11CRERLWfmIiIiIWiV8xLAhadis1EVERM+GXfiQtIWkWyXNkvSEpEmSVmxq9mFJD0l6VtK3JS3X0H9fSfdJel7StZLG93H+lSSdLOlRSU9JOkPSKmXfNpIek3SYpKdLffs09F1F0imSHpE0W9JvG/p+VNLUclyTJb29od+mkv4gaa6ki4GVm2r6iKS7St9bJL2z3b69HGdvY06TdISkeyS9IOkcSW+UdE2Z5wZJazYNua+kx8s5OaxhrF4/T0mW9K+S7gfu76bOf5Y0XdK25X2Pn6+kD0j6czn3kwC1OAf7S+qU1Llo/ux2TltERLRh2IUPYBHwJWA0sBWwHfCFpjYfAzqAzYCdgH0BJO0MHAN8HFgbuBm4sI/znwRsAGwCrAeMBb7asH8dYFTZ/jng+w0X4pOBdwHvAdYCjgQWS9qg1HFoqetq4JeSViwX4l8A55c+lwKf6JpM0mbAucABwOuBM4ErS0jqtW9PehuzodkngA+Uc7EjcA3VuR1N9e/q4KZhtwXWBz4IHC3p/WV7O5/nzsCWwD821bk91Xn7hO1f9/b5ShoN/Bz4SpnrQeCfejsPts+y3WG7Y8Sqo3prGhERfTDswoftKbZvs73Q9jSqC+PWTc1Osj3T9qPAacBuZfsBwLds31cefjwB2KTd1Q9JAvYDvlTGn1vG+HRDswXAcbYX2L4amAdsWFZf9gUOsT3D9iLbt9h+GfgUcJXt620voAopq1CFlHcDKwCnlTEvA+5omG8/4Ezbt5cxfwK8XPq16tuT3sbs8j3bT9meQXWRv932neV4Lgc2bRrz67ZfsH0v8CPKZ9Lm5/mtcr5fbNi2C3AW8GHbvy/bevt8Pwz8yfZl5RyfBjzZxrmIiIgBNuzuoZdVgu9QrWysSnUMU5qaTW94/QgwprweD5wu6ZTGIalWKR5pY/q1y5xTqhzy1/4jGto81/StjvnASKrftlem+o272ZjG+W0vljS91LUImGHbTcfUZTywl6R/a9i2YhnTLfr2pLcxuzzV8PrFbt6PbBqz+TPZGPr1eXY5FDivhJnGunv6fMc0jmPb5RxHRETNhl34AH4A3AnsZnuupEOBTza1GQdMLa/XBR4vr6cDx9v+aT/nfpbqwrpR+Y2/r31fAt4K3N2073HKxRj+usIyDphBFSDGSlJDiFiXv4WYrmM6vnlCSVu36NuTHsdcAuOAPzfU0PWZtPN5mr+3C3COpBm2T2uq++8+X0nrlxq63qvxfSsbjx1FZ77GGhExIIbdbRdgdWAOME/S24CDumlzhKQ1JY0DDgEuLtvPAP5d0kYAkkZJ2qXdiW0vBs4GTpX0hjLG2PLsQTt9zwW+I2mMpBGStirPUVwCTJS0naQVgMOobnPcAtwKLAQOlrS8pI8DWzQMfTZwoKQtVVlN0kRJq7fRtye9jdlf/yFp1XLu9+Fvn0k7n2d3Hqd6PuRgSV3PiPT2+V4FbCTp46q+NXMw1fM5ERFRs+EYPg4HdgfmUl0kL+6mzRVUS/d3UV10zgGwfTnVA6MXSZoD/BHYoY/zHwU8ANxWxrgB2LAPtd9L9dzFzFLLcrb/AuwBfI9qhWRHYEfbr9h+heoByr2B56meD/nvrgFtd1I9ozGp7H+gtKVV3570NuYSuKmM87/AybavK9vb+Tx7qvNRqgBylKTP9/b52n6WarXkROA5qodff7eExxQREf2gVz8OEBHd6ejocGdn51CXERExrEiaYrujeftwXPmIiIiIYSzhoxuq/tjXvG5+PjPUtQ0UScf0cIzXDHVtERHx2jYcv+0y6GxvNNQ1DDbbJ1D9HYyIiIhaZeUjIiIiapXwEREREbVK+IiIiIhaJXxERERErRI+IiIiolYJHxEREVGrhI+IiIioVcJHRERE1CrhIyIiImqV8BERERG1SviIiIiIWiV8RERERK0SPiIiIqJWCR8RERFRq4SPiIiIqFXCR0RERNQq4SMiIiJqlfARERERtUr4iIiIiFolfEREREStEj4iIiKiVgkfERERUavlh7qAGN4kHQusZ3uPHvZ/BtjL9gcHaX4D69t+YDDnvnfGbCYcfdWSDjPopp04cahLiIhoKSsfMWAkTZBkSX8NtbZ/OljBo5WhnDsiInqW8BERERG1SvhYRkmaJukISfdIekHSOZLeKOkaSXMl3SBpTUnbSHqsm77v72bY35T/nSVpnqStJO0t6bdt1LORpOslzZT0lKRjyvYtJN0qaZakJyRNkrRiU/cPS3pI0rOSvi1pudL3VXOXVZkDJd0v6XlJ35ekPp24iIhYYgkfy7ZPAB8ANgB2BK4BjgFGU/3bOLiP472v/O8atkfavrWdTpJWB24A/gcYA6wH/G/ZvQj4UqlpK2A74AtNQ3wM6AA2A3YC9u1luo8AmwP/D9gV2L6XuvaX1Cmpc9H82e0cSkREtCHhY9n2PdtP2Z4B3AzcbvtO2y8DlwOb1lTHR4AnbZ9i+yXbc23fDmB7iu3bbC+0PQ04E9i6qf9JtmfafhQ4Dditl7lOtD2rtP01sElPDW2fZbvDdseIVUctyfFFRESDfNtl2fZUw+sXu3k/sqY6xgEPdrdD0gbAd6hWNlal+jc7panZ9IbXj1CtnvTkyYbX86nvGCMiokj4iFZeoLroAyBpBLB2D23dzzmm0/NqxQ+AO4HdbM+VdCjwyaY244Cp5fW6wOP9rKNHG48dRWe+xhoRMSBy2yVa+T9gZUkTJa0AfAVYqYe2zwCLgX/o4xy/AtaRdKiklSStLmnLsm91YA4wT9LbgIO66X9EeTh2HHAIcHEf54+IiBolfESvbM+mesDzh8AMqpWQx3poOx84Hvhd+XbKu9ucYy7Vg687Ut0WuR/Ytuw+HNgdmAucTffB4gqqWzF3AVcB57Qzb0REDA3Z/V0pj1h2dHR0uLOzc6jLiIgYViRNsd3RvD0rHxEREVGrPHAatZD0Xqq/I/J3bOcbJxERy5CEj6iF7ZvJ11ojIoLcdomIiIiaJXxERERErRI+IiIiolYJHxEREVGrhI+IiIioVcJHRERE1CrhIyIiImqV8BERERG1SviIiIiIWiV8RERERK0SPiIiIqJWCR8RERFRq4SPiIiIqFXCR0RERNQq4SMiIiJqlfARERERtUr4iIiIiFolfEREREStEj4iIiKiVgkfERERUauEj4iIiKhVwkdERETUavmhLiBiOLh3xmwmHH3VUJfRo2knThzqEiIi2paVj3jNk3SspAuGuo6IiKgkfEREREStEj7iNUVSbiVGRCzlEj4GkaRpkg6XdI+k2ZIulrSypDUl/UrSM5KeL6/f3NBvsqRvSrpF0jxJv5T0ekk/lTRH0h2SJjS0f5uk6yXNlPQXSbu2UduPJX1f0lWS5kq6XdJby74Jktx4IS81fb683lvS7ySdKmmWpIckvadsny7paUl7tZj/LaXvcuX9DyU93bD/AkmHltdjJF1Zju8BSfs1tDtW0mWl/Rxg7zL2TeW4rgdGN7RfubR9rsx/h6Q39lDj/pI6JXUumj+71SmNiIg2JXwMvl2BDwFvAd4J7E113n8EjAfWBV4EJjX1+zTwWWAs8Fbg1tJnLeA+4GsAklYDrgd+BrwB2A34L0kbtVHbbsDXgTWBB4Dj+3BcWwL3AK8vc18EbA6sB+wBTJI0sqfOth8G5gCblk3vBeZJent5/z7gpvL6QuAxYAzwSeAESds1DLcTcBmwBvDTUs8UqtDxDaAxCO0FjALGldoPpDr/3dV4lu0O2x0jVh3V27mIiIg+SPgYfN+1/bjtmcAvgU1sP2f757bn255LddHfuqnfj2w/aHs2cA3woO0bbC8ELuVvF+2PANNs/8j2Qtt/AH5OdZFu5b9t/76M+VNgkz4c18NlzkXAxVQX8+Nsv2z7OuAVqiDSm5uArSWtU95fVt6/BXgdcLekccA/A0fZfsn2XcAPqYJZl1tt/8L2YmBtqhD0H6WW31Cd9y4LqELHerYX2Z5ie04fjjsiIpZQ7o8PvicbXs8HxkhaFTiVakVkzbJvdUkjysUc4KmGfi92875rVWE8sKWkWQ37lwfO70dtPa5UdKO5Hmz3VGNPbgI+SrWq8RtgMlWoeAm42fZiSWOAmSWkdXkE6Gh4P73h9RjgedsvNLUfV16fX15fJGkN4ALgy7YXtKg1IiIGSMLH0DgM2BDY0vaTkjYB7gTUj7GmAzfZ/sAA1td14V6V6tYIwDo9tF0SNwHfpgofNwG/Bc6gCh9dt1weB9aStHpDAFkXmNEwjhtePwGsKWm1hgCyblebEjK+Dny9PDdzNfAX4JzeCt147Cg687c0IiIGRG67DI3VqVYGZklai/L8Rj/9CthA0mclrVB+Nm94dqLPbD9DdXHfQ9IISftSPXcyoGzfT3Ue9gB+U25/PAV8ghI+bE8HbgG+VR4WfSfwOarbRN2N+QjQSRUuVpT0z8COXfslbStpY0kjqILVAmBRd2NFRMTgSPgYGqcBqwDPArcB/9PfgcpqwAepHlB9nOpWyknASktY437AEcBzwEZUAWAw3AQ8Z/vRhveiWgnqshswger4Lge+Zvv6XsbcneqB2JlUwe68hn3rUD1bMofqwd2bqG69RERETWS7dauIZVxHR4c7OzuHuoyIiGFF0hTbHc3bs/IRERERtUr4eA2TNLX8kbLmn88sSzVERMTSJd92eQ2z3c4fGnvN1xAREUuXrHxERERErRI+IiIiolYJHxEREVGrhI+IiIioVcJHRERE1CrhIyIiImqV8BERERG1SviIiIiIWiV8RERERK0SPiIiIqJWCR8RERFRq4SPiIiIqFXCR0RERNQq4SMiIiJqlfARERERtUr4iIiIiFolfEREREStEj4iIiKiVgkfERERUauEj4iIiKhVwkdERETUKuEjhhVJG0q6U9JcSQcPdT0REdF3yw91ARF9dCQw2famdU5674zZTDj6qjqnBGDaiRNrnzMiYrBl5SOGm/HA1L52kpSgHRGxlEj4iGFD0o3AtsAkSfMkHVJuwcyRNF3SsQ1tJ0iypM9JehS4sWzfV9J9kp6XdK2k8UNzNBERy66Ejxg2bP8LcDPwRdsjgbuBPYE1gInAQZJ2buq2NfB2YPuy7xjg48DaZawLayo/IiKKhI8YtmxPtn2v7cW276EKEls3NTvW9gu2XwQOAL5l+z7bC4ETgE16Wv2QtL+kTkmdi+bPHtRjiYhYliR8xLAlaUtJv5b0jKTZwIHA6KZm0xtejwdOlzRL0ixgJiBgbHfj2z7LdoftjhGrjhqMQ4iIWCYlfMRw9jPgSmCc7VHAGVRhopEbXk8HDrC9RsPPKrZvqaneiIggX7WN4W11YKbtlyRtAewOXNdL+zOAb0i6y/ZUSaOAD9q+tNVEG48dRWe+9hoRMSCy8hHD2ReA4yTNBb4KXNJbY9uXAycBF0maA/wR2GHQq4yIiFeR7datIpZxHR0d7uzsHOoyIiKGFUlTbHc0b8/KR0RERNQq4SP+//buP8iusr7j+PsjsQgJCQKR4YcEC0grUDLtKtNarSM6ELX+aIulpINSlTodp7XWGbEzVp0RTa1YsXWKULFYi/xQcYqCONOKVkYdFrUqICKayI8EQyAhCZUKfPvHPTu9LLub3eTuc/cm79fMmbl7znOe83zvk10+POfcXUmSmjJ8SJKkpgwfkiSpKcOHJElqyvAhSZKaMnxIkqSmDB+SJKkpw4ckSWrK8CFJkpoyfEiSpKYMH5IkqSnDhyRJasrwIUmSmjJ8SJKkpgwfkiSpKcOHJElqyvAhSZKaMnxIkqSmDB+SJKkpw4ckSWrK8CFJkpoyfEiSpKYMH5IkqSnDhyRJasrwIUmSmlo07AFIo+B7d2/hyHO+0Ox6a9e8tNm1JKk1Vz4kSVJThg9JktSU4UMjKck5Se5IsjXJLUle1e3fK8l5Se5L8pMkb0pSSRZ1x5cl+ViS9UnuTvKeJHsNtxpJ2rP4zIdG1R3A84ANwGnAJ5McDbwCWAWsBLYDV0467xLgXuBoYDHweeBO4KOTL5DkbOBsgL2WLp+XIiRpT+TKh0ZSVV1ZVfdU1WNVdTlwO/Ac4NXA+VV1V1U9AKyZOCfJwfSCyZurantV/Qz4e+D0aa5xYVWNVdXYXvsum/eaJGlP4cqHRlKSM4G3AEd2u5YABwGH0lvJmND/egXwZGB9kol9T5rURpI0zwwfGjlJVgAXAScDX6+qR5N8BwiwHji8r/nT+17fCTwMHFRVj7QaryTp8QwfGkWLgQI2AiQ5Czi+O3YF8BdJvkDvmY+3TZxUVeuTfAk4L8k7gG3AM4DDq+orM13whMOWMe7v3pCkgfCZD42cqroFOA/4Or2HR08AbugOXwR8Cfgu8G3gGuAR4NHu+JnALwG3AA8AnwYOaTV2SRKkqoY9BmneJFkFXFBVK3aln7GxsRofHx/QqCRpz5Dkpqoam7zflQ/tVpLsk+QlSRYlOQx4J3DVsMclSfp/hg/tbgK8m94tlW8DtwJ/M9QRSZIexwdOtVupqoeAZw97HJKk6bnyIUmSmjJ8SJKkpgwfkiSpKcOHJElqyvAhSZKaMnxIkqSmDB+SJKkpw4ckSWrK8CFJkpoyfEiSpKYMH5IkqSnDhyRJasrwIUmSmjJ8SJKkpgwfkiSpKcOHJElqyvAhSZKaMnxIkqSmDB+SJKkpw4ckSWrK8CFJkpoyfEiSpKYMH5IkqSnDhwYmyc1JXjAP/f5LkvcMul9J0nAsGvYAtPuoquOGPQZJ0sLnyockSWrK8KGBSbI2yYuSPCfJeJIHk9yb5IOzOPfKJBuSbEny1STTrqIkeUOSHyW5P8m/Jzm071gleWOS25M8kOQjSdJ3/E+S3Noduy7Jihmuc3ZXx/jGjRvn8lZIkmZg+NB8OB84v6qWAkcBV8zinGuBY4CnAd8C/m2qRkleCLwPeDVwCLAOuGxSs5cBzwZO7Nqd0p37SuCvgd8DlgP/BXxqugFV1YVVNVZVY8uXL59FCZKk2TB8aD78Ajg6yUFVta2qvrGjE6rq4qraWlUPA+8CTkyybIqmq4GLq+pbXdu3A7+Z5Mi+NmuqanNV/RT4MrCy2/+nwPuq6taqegR4L7ByptUPSdLgGT40H14HPBP4QZIbk7xspsZJ9kqyJskdSR4E1naHDpqi+aH0VjsAqKptwCbgsL42G/pePwQs6V6vAM5PsjnJZuB+IJPOlSTNMz/tooGrqtuBP0ryJHq3OD6d5MCq2j7NKWcArwBeRC94LAMeoBcMJruHXogAIMli4EDg7lkM7U7g3Kqa8paOJKkNVz40cEn+OMnyqnoM2NztfnSGU/YDHqa3grEvvdsh07kUOCvJyiR7d22/WVVrZzG0C4C3TzzMmmRZktNmcZ4kaYAMH5oPpwI3J9lG7+HT06vq5zO0/wS9Wyl3A7cA0z4jUlX/AbwD+Aywnt4DrafPZlBVdRXwt8Bl3e2d7wOrZnOuJGlwUlXDHoO04I2NjdX4+PiwhyFJIyXJTVU1Nnm/Kx+SJKkpw4eaSLI6ybYptpuHPTZJUlt+2kVNdJ8w8VMmkiRXPiRJUluGD0mS1JThQ5IkNWX4kCRJTRk+JElSU4YPSZLUlOFDkiQ1ZfiQJElNGT4kSVJThg9JktSU4UOSJDVl+JAkSU0ZPiRJUlOGD0mS1JThQ5IkNWX4kCRJTRk+JElSU4YPSZLUlOFDkiQ1ZfiQJElNGT4kSVJThg9JktSU4UOSJDVl+NCMkqxN8qJhj2M6SV6b5GszHL82yWtajkmSNLNFwx6ANJ+qatWwxyBJejxXPiRJUlOGD83GyiTfTbIlyeVJnpLkqUk+n2Rjkge614dPnNDdDvlxkq1JfpJk9Y4ukuQNSW7tzrklya93+89Jckff/lc98dT8Qze+HyQ5ue/A9Ule3zemryX5QDfmnySZdmUkydlJxpOMb9y4cc5vmiRpaoYPzcargVOBZwC/BryW3r+djwMrgCOA/wH+ESDJYuDDwKqq2g/4LeA7M10gyWnAu4AzgaXAy4FN3eE7gOcBy4B3A59Mckjf6ScBPwYOAt4JfDbJAdNc6iTgtq7t+4GPJclUDavqwqoaq6qx5cuXzzR8SdIcGD40Gx+uqnuq6n7gamBlVW2qqs9U1UNVtRU4F/idvnMeA45Psk9Vra+qm3dwjdcD76+qG6vnR1W1DqCqruyu/1hVXQ7cDjyn79yfAR+qql90x28DXjrNddZV1UVV9ShwCXAIcPDc3g5J0q4wfGg2NvS9fghYkmTfJB9Nsi7Jg8BXgf2T7FVV24E/BN4IrE/yhSS/soNrPJ3eCscTJDkzyXeSbE6yGTie3srFhLurqvq+XgccuqNaquqh7uWSHYxNkjRAhg/trL8CjgVOqqqlwPO7/QGoquuq6sX0VhZ+AFy0g/7uBI6avDPJiu7cNwEHVtX+wPcnrtM5bNKtkyOAe+ZckSSpCcOHdtZ+9J7z2Nw9X/HOiQNJDk7y8u7Zj4eBbcCjO+jvn4G3JvmN9BzdBY/FQAEbu77Porfy0e9pwJ8neXL37MivAtfseomSpPlg+NDO+hCwD3Af8A3gi33HnkRvZeQe4H56z4L82UydVdWV9J4buRTYCnwOOKCqbgHOA74O3AucANww6fRvAsd0YzkX+IOq2oQkaUHK42+VS5rK2NhYjY+PD3sYkjRSktxUVWOT97vyIUmSmjJ8qJkkFyTZNsV2wbDHJklqx7/tomaq6o30Pn4rSdqDufIhSZKaMnxIkqSmDB+SJKkpw4ckSWrK8CFJkpoyfEiSpKYMH5IkqSnDhyRJasrwIUmSmjJ8SJKkpgwfkiSpKcOHJElqyvAhSZKaMnxIkqSmDB+SJKmpVNWwxyAteEm2ArcNexwDchBw37AHMSC7Sy27Sx1gLQvVsGpZUVXLJ+9cNISBSKPotqoaG/YgBiHJuLUsLLtLHWAtC9VCq8XbLpIkqSnDhyRJasrwIc3OhcMewABZy8Kzu9QB1rJQLahafOBUkiQ15cqHJElqmXAgjQAAA59JREFUyvAhSZKaMnxoj5TkgCRXJdmeZF2SM2Zo+5dJNiTZkuTiJHvvTD/zZYC1XJ/k50m2dVvz32sy21qSHJ/kuiT3JXnCveNRmpdZ1DJK8/KaJDcleTDJXUnen2TRXPuZTwOsZZTm5fQkt3Xf9z9LckmSpXPtZ5AMH9pTfQT4X+BgYDXwT0mOm9woySnAOcDJwJHALwPvnms/82xQtQC8qaqWdNux8zrqqc32/fwFcAXwul3sZz4NqhYYnXnZF3gzvV9odRK9f2tv3Yl+5tOgaoHRmZcbgOdW1TJ63/eLgPfsRD+DU1VubnvUBizuvtGe2bfvX4E1U7S9FHhv39cnAxvm2s9Cr6X7+nrg9aMwL33Hj+79GNu1fhZqLaM6L33t3gJcPcrzMlUtozwvwBLgE8A1w5wXVz60J3om8GhV/bBv338DUyX947pj/e0OTnLgHPuZL4OqZcL7uuX/G5K8YOCjndmg3s9Rm5fZGNV5eT5w8wD6GZRB1TJhZOYlyW8n2QJsBX4f+NDO9DMohg/tiZYAWybt2wLsN4u2E6/3m2M/82VQtQC8jd6S7GH0fifA1UmOGtxQd2hQ7+eozcuOjOS8JDkLGAM+sCv9DNigaoERm5eq+lr1brscDvwdsHZn+hkUw4f2RNuApZP2LaX3fwQ7ajvxeusc+5kvg6qFqvpmVW2tqoer6hJ694lfMuDxzmRQ7+eozcuMRnFekrwSWAOsqqqJP2Y2kvMyTS0jOS8AVXU38EXgsl3pZ1cZPrQn+iGwKMkxfftO5IlLqnT7TpzU7t6q2jTHfubLoGqZSgEZyChnZ1Dv56jNy1wt6HlJcipwEfC7VfW9ne1nngyqlqks6HmZZBEwsUoznHkZ1sMybm7D3Oil/k/Re9jqufSWGY+bot2pwAbgWcBTgf+k70Gs2faz0GsB9gdOAZ5C7wfTamA7cOwCrSXdWJ9F74f+U4C9R3Repq1lBOflhcAm4Pm70s9Cr2UE52U1cET3b20F8BXgs8Ocl2ZvkpvbQtqAA4DPdT8wfgqc0e0/gt4y5BF9bd8C3As8CHx80n/kpuxn1GoBlgM30ltq3Qx8A3jxQq2F3keFa9K2dhTnZaZaRnBevgw80u2b2K4d0XmZtpYRnJdzgbu6dnfRe0blwGHOi3/bRZIkNeUzH5IkqSnDhyRJasrwIUmSmjJ8SJKkpgwfkiSpKcOHJElqyvAhSZKaMnxIkqSmDB+SJKmp/wMLIvy9/Qk+TAAAAABJRU5ErkJggg==\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(6, 9))\n",
"\n",
"ind = np.argsort(xgb_model.feature_importances_)[::-1]\n",
"features_sorted = np.array(features)[ind]\n",
"importances_sorted = xgb_model.feature_importances_[ind]\n",
"\n",
"plt.barh(y=range(len(features)), width=importances_sorted, height=0.2)\n",
"plt.title('Gain')\n",
"plt.yticks(ticks=range(len(features)), labels=features_sorted)\n",
"plt.gca().invert_yaxis()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modeling (part 2): Linear models & Ensembles\n",
"\n",
"Given the randomness of the _Titanic dataset_ , we can be satisfied with the performance of `xgboost` model above. Still, it is always usefull to try a variety of models and approaches, especially since `vaex` makes makes this process rather simple. \n",
"\n",
"In the following part we will use a couple of linear models as our predictors, this time straight from `scikit-learn`. This requires us to pre-process the data in a slightly different way.\n",
"\n",
"### Feature pre-processing for linear models\n",
"\n",
"When using linear models, the safest option is to encode categorical variables with the one-hot encoding scheme, especially if they have low cardinality. We will do this for the \"family_size\" and \"deck\" features. Note that the \"sex\" feature is already encoded since it has only unique values options. \n",
"\n",
"The \"name_title\" feature is a bit more tricky. Since in its original form it has some values that only appear a couple of times, we will do a trick: we will one-hot encode the frequency encoded values. This will reduce cardinality of the feature, while also preserving the most important, i.e. most common values.\n",
"\n",
"Regarding the \"age\" and \"fare\", to add some variance in the model, we will not convert them to categorical as before, but simply remove their mean and standard-deviations (standard-scaling). We will do the same to the \"fare_per_family_member\" feature.\n",
"\n",
"\n",
"Finally, we will drop out any other features."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:41.979030Z",
"start_time": "2020-05-01T17:12:41.922481Z"
}
},
"outputs": [],
"source": [
"# One-hot encode categorical features\n",
"one_hot = vaex.ml.OneHotEncoder(features=['deck', 'family_size', 'name_title'])\n",
"df_train = one_hot.fit_transform(df_train)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:42.072684Z",
"start_time": "2020-05-01T17:12:41.988593Z"
}
},
"outputs": [],
"source": [
"# Standard scale numerical features\n",
"standard_scaler = vaex.ml.StandardScaler(features=['age', 'fare', 'fare_per_family_member'])\n",
"df_train = standard_scaler.fit_transform(df_train)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:42.088401Z",
"start_time": "2020-05-01T17:12:42.076102Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['deck_A',\n",
" 'deck_B',\n",
" 'deck_C',\n",
" 'deck_D',\n",
" 'deck_E',\n",
" 'deck_F',\n",
" 'deck_G',\n",
" 'deck_M',\n",
" 'family_size_1',\n",
" 'family_size_2',\n",
" 'family_size_3',\n",
" 'family_size_4',\n",
" 'family_size_5',\n",
" 'family_size_6',\n",
" 'family_size_7',\n",
" 'family_size_8',\n",
" 'family_size_11',\n",
" 'standard_scaled_age',\n",
" 'standard_scaled_fare',\n",
" 'standard_scaled_fare_per_family_member',\n",
" 'label_encoded_sex']"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get the features for training a linear model\n",
"features_linear = df_train.get_column_names(regex='^deck_|^family_size_|^frequency_encoded_name_title_')\n",
"features_linear += df_train.get_column_names(regex='^standard_scaled_')\n",
"features_linear += ['label_encoded_sex']\n",
"features_linear"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Estimators: `SVC` and `LogisticRegression`"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:42.170145Z",
"start_time": "2020-05-01T17:12:42.095159Z"
}
},
"outputs": [],
"source": [
"from sklearn.svm import SVC\n",
"from sklearn.linear_model import LogisticRegression"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:42.646357Z",
"start_time": "2020-05-01T17:12:42.172042Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/jovan/miniconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:231: ConvergenceWarning: Solver terminated early (max_iter=1000). Consider pre-processing your data with StandardScaler or MinMaxScaler.\n",
" % self.max_iter, ConvergenceWarning)\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
#
pclass
survived
name
sex
age
sibsp
parch
ticket
fare
cabin
embarked
boat
body
home_dest
name_title
name_num_words
deck
multi_cabin
has_cabin
family_size
is_alone
age_times_class
fare_per_family_member
label_encoded_sex
label_encoded_embarked
label_encoded_deck
frequency_encoded_name_title
prediction_xgb
deck_A
deck_B
deck_C
deck_D
deck_E
deck_F
deck_G
deck_M
family_size_1
family_size_2
family_size_3
family_size_4
family_size_5
family_size_6
family_size_7
family_size_8
family_size_11
name_title_Capt
name_title_Col
name_title_Countess
name_title_Don
name_title_Dona
name_title_Dr
name_title_Jonkheer
name_title_Lady
name_title_Major
name_title_Master
name_title_Miss
name_title_Mlle
name_title_Mme
name_title_Mr
name_title_Mrs
name_title_Ms
name_title_Rev
standard_scaled_age
standard_scaled_fare
standard_scaled_fare_per_family_member
prediction_svc
prediction_lr
\n",
"\n",
"\n",
"
0
3
False
Stoytcheff, Mr. Ilia
male
19
0
0
349205
7.8958
M
S
None
nan
None
Mr
3
M
0
1
1
0
57
7.8958
0
0
0
0.578797
False
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
-0.807704
-0.493719
-0.342804
False
False
\n",
"
1
1
False
Payne, Mr. Vivian Ponsonby
male
23
0
0
12749
93.5
B24
S
None
nan
Montreal, PQ
Mr
4
B
0
1
1
0
23
93.5
0
0
1
0.578797
False
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
-0.492921
1.19613
1.99718
False
True
\n",
"
2
3
True
Abbott, Mrs. Stanton (Rosa Hunt)
female
35
1
1
C.A. 2673
20.25
M
S
A
nan
East Providence, RI
Mrs
5
M
0
1
3
0
105
6.75
1
0
0
0.145177
True
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0.45143
-0.249845
-0.374124
True
True
\n",
"
3
2
True
Hocking, Miss. Ellen \"Nellie\"
female
20
2
1
29105
23
M
S
4
nan
Cornwall / Akron, OH
Miss
4
M
0
1
4
0
40
5.75
1
0
0
0.201528
True
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
-0.729008
-0.195559
-0.401459
True
True
\n",
"
4
3
False
Nilsson, Mr. August Ferdinand
male
21
0
0
350410
7.8542
M
S
None
nan
None
Mr
4
M
0
1
1
0
63
7.8542
0
0
0
0.578797
False
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
-0.650312
-0.494541
-0.343941
False
False
\n",
"\n",
"
"
],
"text/plain": [
" # pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home_dest name_title name_num_words deck multi_cabin has_cabin family_size is_alone age_times_class fare_per_family_member label_encoded_sex label_encoded_embarked label_encoded_deck frequency_encoded_name_title prediction_xgb deck_A deck_B deck_C deck_D deck_E deck_F deck_G deck_M family_size_1 family_size_2 family_size_3 family_size_4 family_size_5 family_size_6 family_size_7 family_size_8 family_size_11 name_title_Capt name_title_Col name_title_Countess name_title_Don name_title_Dona name_title_Dr name_title_Jonkheer name_title_Lady name_title_Major name_title_Master name_title_Miss name_title_Mlle name_title_Mme name_title_Mr name_title_Mrs name_title_Ms name_title_Rev standard_scaled_age standard_scaled_fare standard_scaled_fare_per_family_member prediction_svc prediction_lr\n",
" 0 3 False Stoytcheff, Mr. Ilia male 19 0 0 349205 7.8958 M S None nan None Mr 3 M 0 1 1 0 57 7.8958 0 0 0 0.578797 False 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.807704 -0.493719 -0.342804 False False\n",
" 1 1 False Payne, Mr. Vivian Ponsonby male 23 0 0 12749 93.5 B24 S None nan Montreal, PQ Mr 4 B 0 1 1 0 23 93.5 0 0 1 0.578797 False 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.492921 1.19613 1.99718 False True\n",
" 2 3 True Abbott, Mrs. Stanton (Rosa Hunt) female 35 1 1 C.A. 2673 20.25 M S A nan East Providence, RI Mrs 5 M 0 1 3 0 105 6.75 1 0 0 0.145177 True 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0.45143 -0.249845 -0.374124 True True\n",
" 3 2 True Hocking, Miss. Ellen \"Nellie\" female 20 2 1 29105 23 M S 4 nan Cornwall / Akron, OH Miss 4 M 0 1 4 0 40 5.75 1 0 0 0.201528 True 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 -0.729008 -0.195559 -0.401459 True True\n",
" 4 3 False Nilsson, Mr. August Ferdinand male 21 0 0 350410 7.8542 M S None nan None Mr 4 M 0 1 1 0 63 7.8542 0 0 0 0.578797 False 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.650312 -0.494541 -0.343941 False False"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The Support Vector Classifier\n",
"vaex_svc = vaex.ml.sklearn.Predictor(features=features_linear, \n",
" target='survived',\n",
" model=SVC(max_iter=1000, random_state=42),\n",
" prediction_name='prediction_svc')\n",
"\n",
"# Logistic Regression\n",
"vaex_logistic = vaex.ml.sklearn.Predictor(features=features_linear, \n",
" target='survived',\n",
" model=LogisticRegression(max_iter=1000, random_state=42),\n",
" prediction_name='prediction_lr')\n",
"\n",
"# Train the new models and apply the transformation to the train dataframe\n",
"for model in [vaex_svc, vaex_logistic]:\n",
" model.fit(df_train)\n",
" df_train = model.transform(df_train)\n",
" \n",
"# Preview of the train DataFrame\n",
"df_train.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ensemble\n",
"\n",
"Just as before, the predictions from the `SVC` and the `LogisticRegression` classifiers are added as virtual columns in the training dataset. This is quite powerful, since now we can easily use them to create an ensemble! For example, let's do a weighted mean."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:42.958447Z",
"start_time": "2020-05-01T17:12:42.653715Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
#
prediction_xgb
prediction_svc
prediction_lr
prediction_final
\n",
"\n",
"\n",
"
0
False
False
False
False
\n",
"
1
False
False
True
False
\n",
"
2
True
True
True
True
\n",
"
3
True
True
True
True
\n",
"
4
False
False
False
False
\n",
"
...
...
...
...
...
\n",
"
1,042
False
False
False
False
\n",
"
1,043
False
True
True
True
\n",
"
1,044
True
True
False
True
\n",
"
1,045
False
True
True
True
\n",
"
1,046
False
False
False
False
\n",
"\n",
"
"
],
"text/plain": [
"# prediction_xgb prediction_svc prediction_lr prediction_final\n",
"0 False False False False\n",
"1 False False True False\n",
"2 True True True True\n",
"3 True True True True\n",
"4 False False False False\n",
"... ... ... ... ...\n",
"1,042 False False False False\n",
"1,043 False True True True\n",
"1,044 True True False True\n",
"1,045 False True True True\n",
"1,046 False False False False"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Weighed mean of the classes\n",
"prediction_final = (df_train.prediction_xgb.astype('int') * 0.3 + \n",
" df_train.prediction_svc.astype('int') * 0.5 + \n",
" df_train.prediction_xgb.astype('int') * 0.2)\n",
"# Get the predicted class\n",
"prediction_final = (prediction_final >= 0.5)\n",
"# Add the expression to the train DataFrame\n",
"df_train['prediction_final'] = prediction_final\n",
"\n",
"# Preview\n",
"df_train[df_train.get_column_names(regex='^predict')]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Performance (part 2)\n",
"\n",
"Applying the ensembler to the test set is just as easy as before. We just need to get the new state of the training DataFrame, and transfer it to the test DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:43.334411Z",
"start_time": "2020-05-01T17:12:42.961373Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
#
pclass
survived
name
sex
age
sibsp
parch
ticket
fare
cabin
embarked
boat
body
home_dest
name_title
name_num_words
deck
multi_cabin
has_cabin
family_size
is_alone
age_times_class
fare_per_family_member
label_encoded_sex
label_encoded_embarked
label_encoded_deck
frequency_encoded_name_title
prediction_xgb
deck_A
deck_B
deck_C
deck_D
deck_E
deck_F
deck_G
deck_M
family_size_1
family_size_2
family_size_3
family_size_4
family_size_5
family_size_6
family_size_7
family_size_8
family_size_11
name_title_Capt
name_title_Col
name_title_Countess
name_title_Don
name_title_Dona
name_title_Dr
name_title_Jonkheer
name_title_Lady
name_title_Major
name_title_Master
name_title_Miss
name_title_Mlle
name_title_Mme
name_title_Mr
name_title_Mrs
name_title_Ms
name_title_Rev
standard_scaled_age
standard_scaled_fare
standard_scaled_fare_per_family_member
prediction_svc
prediction_lr
prediction_final
\n",
"\n",
"\n",
"
0
3
False
O'Connor, Mr. Patrick
male
28.032
0
0
366713
7.75
M
Q
None
nan
None
Mr
3
M
0
1
1
0
84.096
7.75
0
1
0
0.578797
False
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
-0.096924
-0.496597
-0.346789
False
False
False
\n",
"
1
3
False
Canavan, Mr. Patrick
male
21
0
0
364858
7.75
M
Q
None
nan
Ireland Philadelphia, PA
Mr
3
M
0
1
1
0
63
7.75
0
1
0
0.578797
False
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
-0.650312
-0.496597
-0.346789
False
False
False
\n",
"
2
1
False
Ovies y Rodriguez, Mr. Servando
male
28.5
0
0
PC 17562
27.7208
D43
C
None
189
?Havana, Cuba
Mr
5
D
0
1
1
0
28.5
27.7208
0
2
4
0.578797
True
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
-0.0600935
-0.102369
0.19911
False
False
True
\n",
"
3
3
False
Windelov, Mr. Einar
male
21
0
0
SOTON/OQ 3101317
7.25
M
S
None
nan
None
Mr
3
M
0
1
1
0
63
7.25
0
0
0
0.578797
False
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
-0.650312
-0.506468
-0.360456
False
False
False
\n",
"
4
2
True
Shelley, Mrs. William (Imanita Parrish Hall)
female
25
0
1
230433
26
M
S
12
nan
Deer Lodge, MT
Mrs
6
M
0
1
2
0
50
13
1
0
0
0.145177
True
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
-0.335529
-0.136338
-0.203281
True
True
True
\n",
"\n",
"
"
],
"text/plain": [
" # pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home_dest name_title name_num_words deck multi_cabin has_cabin family_size is_alone age_times_class fare_per_family_member label_encoded_sex label_encoded_embarked label_encoded_deck frequency_encoded_name_title prediction_xgb deck_A deck_B deck_C deck_D deck_E deck_F deck_G deck_M family_size_1 family_size_2 family_size_3 family_size_4 family_size_5 family_size_6 family_size_7 family_size_8 family_size_11 name_title_Capt name_title_Col name_title_Countess name_title_Don name_title_Dona name_title_Dr name_title_Jonkheer name_title_Lady name_title_Major name_title_Master name_title_Miss name_title_Mlle name_title_Mme name_title_Mr name_title_Mrs name_title_Ms name_title_Rev standard_scaled_age standard_scaled_fare standard_scaled_fare_per_family_member prediction_svc prediction_lr prediction_final\n",
" 0 3 False O'Connor, Mr. Patrick male 28.032 0 0 366713 7.75 M Q None nan None Mr 3 M 0 1 1 0 84.096 7.75 0 1 0 0.578797 False 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.096924 -0.496597 -0.346789 False False False\n",
" 1 3 False Canavan, Mr. Patrick male 21 0 0 364858 7.75 M Q None nan Ireland Philadelphia, PA Mr 3 M 0 1 1 0 63 7.75 0 1 0 0.578797 False 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.650312 -0.496597 -0.346789 False False False\n",
" 2 1 False Ovies y Rodriguez, Mr. Servando male 28.5 0 0 PC 17562 27.7208 D43 C None 189 ?Havana, Cuba Mr 5 D 0 1 1 0 28.5 27.7208 0 2 4 0.578797 True 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.0600935 -0.102369 0.19911 False False True\n",
" 3 3 False Windelov, Mr. Einar male 21 0 0 SOTON/OQ 3101317 7.25 M S None nan None Mr 3 M 0 1 1 0 63 7.25 0 0 0 0.578797 False 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.650312 -0.506468 -0.360456 False False False\n",
" 4 2 True Shelley, Mrs. William (Imanita Parrish Hall) female 25 0 1 230433 26 M S 12 nan Deer Lodge, MT Mrs 6 M 0 1 2 0 50 13 1 0 0 0.145177 True 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -0.335529 -0.136338 -0.203281 True True True"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# State transfer\n",
"state_new = df_train.state_get()\n",
"df_test.state_set(state_new)\n",
"\n",
"# Preview\n",
"df_test.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, let's check the performance of all the individual models as well as on the ensembler, on the test set."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-01T17:12:43.490196Z",
"start_time": "2020-05-01T17:12:43.337368Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"prediction_xgb\n",
"Accuracy: 0.798\n",
"f1 score: 0.744\n",
"roc-auc: 0.785\n",
" \n",
"prediction_svc\n",
"Accuracy: 0.802\n",
"f1 score: 0.743\n",
"roc-auc: 0.786\n",
" \n",
"prediction_lr\n",
"Accuracy: 0.779\n",
"f1 score: 0.713\n",
"roc-auc: 0.762\n",
" \n",
"prediction_final\n",
"Accuracy: 0.821\n",
"f1 score: 0.785\n",
"roc-auc: 0.817\n",
" \n"
]
}
],
"source": [
"pred_columns = df_train.get_column_names(regex='^prediction_')\n",
"for i in pred_columns:\n",
" print(i)\n",
" binary_metrics(y_true=df_test.survived.values, y_pred=df_test[i].values)\n",
" print(' ')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that our ensembler is doing a better job than any idividual model, as expected.\n",
"\n",
"Thanks you for going over this example. Feel free to copy, modify, and in general play around with this notebook."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}