Jupyter Notebook: Forget CSV, Fetch Data With Python
In this post, see how to call the Oracle DB from Jupyter notebook with Python code.
Join the DZone community and get the full member experience.
Join For FreeIf you read a book, article, or blog about machine learning, chances are it will use training data from a CSV file. There's nothing wrong with CSV, but let's think about if it is really practical. Wouldn't it be better to read data directly from the DB? Often, you can't feed business data directly into ML training because it needs pre-processing — changing categorial data, calculating new data features, etc. Data preparation/transformation steps can be done quite easily with SQL while fetching original business data. Another advantage of reading data directly from DB is when data changes, it is easier to automate the ML model re-train process.
In this post, I describe how to call the Oracle DB from Jupyter notebook with Python code.
Step 1
Install cx_Oracle Python module:
python -m pip install cx_Oracle
This module helps to connect to the Oracle DB from Python.
Step 2
cx_Oracle enables us to execute SQL call from Python code, but to be able to call remote DB from Python script, we need to install and configure Oracle Instant Client on the machine where Python runs.
If you are using Ubuntu, install alien:
sudo apt-get update
sudo apt-get install alien
Download RPM files for Oracle Instant Client and install with alien:
alien -i oracle-instantclient18.3-basiclite-18.3.0.0.0-1.x86_64.rpm
alien -i oracle-instantclient18.3-sqlplus-18.3.0.0.0-1.x86_64.rpm
alien -i oracle-instantclient18.3-devel-18.3.0.0.0-1.x86_64.rpm
Add environment variables:
export ORACLE_HOME=/usr/lib/oracle/18.3/client64
export PATH=$PATH:$ORACLE_HOME/bin
Read more here.
Step 3
Install Magic SQL Python modules:
pip install jupyter-sql
pip install ipython-sql
Installation and configuration complete.
For today's sample, I'm using the Pima Indians Diabetes Database. CSV data can be downloaded from here. I uploaded CSV data into the database table and will be fetching it through SQL directly in Jupyter notebook.
First of all, the connection is established to the DB and then SQL query is executed. The query result set is stored in a variable called result. Do you see %%sql — this magic SQL:
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import cx_Oracle\n",
"\n",
"%load_ext sql"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Connected: hr@'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with open('credentials.json') as f:\n",
" data = json.load(f)\n",
" username = data['username']\n",
" password = data['password']\n",
"\n",
"%sql oracle+cx_oracle://$username:$password@dbhost:1521/?service_name=ORCLPDB1.localdomain"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * oracle+cx_oracle://hr:***@dbhost:1521/?service_name=ORCLPDB1.localdomain\n",
"0 rows affected.\n",
"Returning data to local variable result\n"
]
}
],
"source": [
"%%sql result <<\n",
"select TIMES_PREGNANT \"TIMES_PREGNANT\", GLUCOSE \"GLUCOSE\", BLOOD_PRESSURE \"BLOOD_PRESSURE\",\n",
" SKIN_FOLD_THICK \"SKIN_FOLD_THICK\", SERUM_INSULIN \"SERUM_INSULING\",\n",
" MASS_INDEX \"MASS_INDEX\", DIABETES_PEDIGREE \"DIABETES_PEDIGREE\", AGE \"AGE\",\n",
" CLASS_VAR \"CLASS_VAR\" from PIMA_INDIANS_DIABETES"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Username and password must be specified while establishing a connection. To avoid sharing a password, make sure to read password value from the external source (it could be simple JSON file as in this example or a more advanced encoded token from keyring).
The beauty of this approach is that data fetched through SQL query is out-of-the-box available in Data Frame. Machine Learning engineers can work with the data in the same way as it would be loaded through CSV:
{
"cells": [
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>times_pregnant</th>\n",
" <th>glucose</th>\n",
" <th>blood_pressure</th>\n",
" <th>skin_fold_thick</th>\n",
" <th>serum_insuling</th>\n",
" <th>mass_index</th>\n",
" <th>diabetes_pedigree</th>\n",
" <th>age</th>\n",
" <th>class_var</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>106</td>\n",
" <td>70</td>\n",
" <td>28</td>\n",
" <td>135</td>\n",
" <td>34.2</td>\n",
" <td>0.142</td>\n",
" <td>22</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>155</td>\n",
" <td>52</td>\n",
" <td>27</td>\n",
" <td>540</td>\n",
" <td>38.7</td>\n",
" <td>0.24</td>\n",
" <td>25</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>101</td>\n",
" <td>58</td>\n",
" <td>35</td>\n",
" <td>90</td>\n",
" <td>21.8</td>\n",
" <td>0.155</td>\n",
" <td>22</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>120</td>\n",
" <td>80</td>\n",
" <td>48</td>\n",
" <td>200</td>\n",
" <td>38.9</td>\n",
" <td>1.162</td>\n",
" <td>41</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>11</td>\n",
" <td>127</td>\n",
" <td>106</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>39</td>\n",
" <td>0.19</td>\n",
" <td>51</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" times_pregnant glucose blood_pressure skin_fold_thick serum_insuling \\\n",
"0 1 106 70 28 135 \n",
"1 2 155 52 27 540 \n",
"2 2 101 58 35 90 \n",
"3 1 120 80 48 200 \n",
"4 11 127 106 0 0 \n",
"\n",
" mass_index diabetes_pedigree age class_var \n",
"0 34.2 0.142 22 0 \n",
"1 38.7 0.24 25 1 \n",
"2 21.8 0.155 22 0 \n",
"3 38.9 1.162 41 0 \n",
"4 39 0.19 51 0 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = result.DataFrame()\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of rows in dataset: {df.shape[0]}\n",
"0 500\n",
"1 268\n",
"Name: class_var, dtype: int64\n"
]
}
],
"source": [
"print('Number of rows in dataset: {df.shape[0]}')\n",
"print(df[df.columns[8]].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"times_pregnant 0.221898\n",
"glucose 0.466581\n",
"blood_pressure 0.065068\n",
"skin_fold_thick 0.074752\n",
"serum_insuling 0.130548\n",
"age 0.238356\n",
"class_var 1.000000\n",
"Name: class_var, dtype: float64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corrs = df.corr()['class_var'].abs()\n",
"columns = corrs[corrs > .01].index\n",
"corrs = corrs.filter(columns)\n",
"corrs"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sample Jupyter notebook available on GitHub. Sample credentials JSON file.
Published at DZone with permission of Andrejus Baranovskis, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments