DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Jupyter Notebook: Forget CSV, Fetch Data With Python

Jupyter Notebook: Forget CSV, Fetch Data With Python

In this post, see how to call the Oracle DB from Jupyter notebook with Python code.

Andrejus Baranovskis user avatar by
Andrejus Baranovskis
·
Mar. 19, 19 · Tutorial
Like (2)
Save
Tweet
Share
9.14K Views

Join the DZone community and get the full member experience.

Join For Free

If you read a book, article, or blog about machine learning, chances are it will use training data from a CSV file. There's nothing wrong with CSV, but let's think about if it is really practical. Wouldn't it be better to read data directly from the DB? Often, you can't feed business data directly into ML training because it needs pre-processing — changing categorial data, calculating new data features, etc. Data preparation/transformation steps can be done quite easily with SQL while fetching original business data. Another advantage of reading data directly from DB is when data changes, it is easier to automate the ML model re-train process.

In this post, I describe how to call the Oracle DB from Jupyter notebook with Python code.

Step 1

Install cx_Oracle Python module:

python -m pip install cx_Oracle

This module helps to connect to the Oracle DB from Python.

Step 2

cx_Oracle enables us to execute SQL call from Python code, but to be able to call remote DB from Python script, we need to install and configure Oracle Instant Client on the machine where Python runs.

If you are using Ubuntu, install alien:

sudo apt-get update 
sudo apt-get install alien

Download RPM files for Oracle Instant Client and install with alien:

alien -i oracle-instantclient18.3-basiclite-18.3.0.0.0-1.x86_64.rpm 
alien -i oracle-instantclient18.3-sqlplus-18.3.0.0.0-1.x86_64.rpm 
alien -i oracle-instantclient18.3-devel-18.3.0.0.0-1.x86_64.rpm

Add environment variables:

export ORACLE_HOME=/usr/lib/oracle/18.3/client64 
export PATH=$PATH:$ORACLE_HOME/bin

Read more here.

Step 3

Install Magic SQL Python modules:

pip install jupyter-sql 
pip install ipython-sql

Installation and configuration complete.

For today's sample, I'm using the Pima Indians Diabetes Database. CSV data can be downloaded from here. I uploaded CSV data into the database table and will be fetching it through SQL directly in Jupyter notebook.

First of all, the connection is established to the DB and then SQL query is executed. The query result set is stored in a variable called result. Do you see %%sql — this magic SQL:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import cx_Oracle\n",
    "\n",
    "%load_ext sql"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Connected: hr@'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open('credentials.json') as f:\n",
    "    data = json.load(f)\n",
    "    username = data['username']\n",
    "    password = data['password']\n",
    "\n",
    "%sql oracle+cx_oracle://$username:$password@dbhost:1521/?service_name=ORCLPDB1.localdomain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " * oracle+cx_oracle://hr:***@dbhost:1521/?service_name=ORCLPDB1.localdomain\n",
      "0 rows affected.\n",
      "Returning data to local variable result\n"
     ]
    }
   ],
   "source": [
    "%%sql result <<\n",
    "select TIMES_PREGNANT \"TIMES_PREGNANT\", GLUCOSE \"GLUCOSE\", BLOOD_PRESSURE \"BLOOD_PRESSURE\",\n",
    "       SKIN_FOLD_THICK \"SKIN_FOLD_THICK\", SERUM_INSULIN \"SERUM_INSULING\",\n",
    "       MASS_INDEX \"MASS_INDEX\", DIABETES_PEDIGREE \"DIABETES_PEDIGREE\", AGE \"AGE\",\n",
    "       CLASS_VAR \"CLASS_VAR\" from PIMA_INDIANS_DIABETES"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Username and password must be specified while establishing a connection. To avoid sharing a password, make sure to read password value from the external source (it could be simple JSON file as in this example or a more advanced encoded token from keyring).

The beauty of this approach is that data fetched through SQL query is out-of-the-box available in Data Frame. Machine Learning engineers can work with the data in the same way as it would be loaded through CSV:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>times_pregnant</th>\n",
       "      <th>glucose</th>\n",
       "      <th>blood_pressure</th>\n",
       "      <th>skin_fold_thick</th>\n",
       "      <th>serum_insuling</th>\n",
       "      <th>mass_index</th>\n",
       "      <th>diabetes_pedigree</th>\n",
       "      <th>age</th>\n",
       "      <th>class_var</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>106</td>\n",
       "      <td>70</td>\n",
       "      <td>28</td>\n",
       "      <td>135</td>\n",
       "      <td>34.2</td>\n",
       "      <td>0.142</td>\n",
       "      <td>22</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>155</td>\n",
       "      <td>52</td>\n",
       "      <td>27</td>\n",
       "      <td>540</td>\n",
       "      <td>38.7</td>\n",
       "      <td>0.24</td>\n",
       "      <td>25</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>101</td>\n",
       "      <td>58</td>\n",
       "      <td>35</td>\n",
       "      <td>90</td>\n",
       "      <td>21.8</td>\n",
       "      <td>0.155</td>\n",
       "      <td>22</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>120</td>\n",
       "      <td>80</td>\n",
       "      <td>48</td>\n",
       "      <td>200</td>\n",
       "      <td>38.9</td>\n",
       "      <td>1.162</td>\n",
       "      <td>41</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11</td>\n",
       "      <td>127</td>\n",
       "      <td>106</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39</td>\n",
       "      <td>0.19</td>\n",
       "      <td>51</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   times_pregnant  glucose  blood_pressure  skin_fold_thick  serum_insuling  \\\n",
       "0               1      106              70               28             135   \n",
       "1               2      155              52               27             540   \n",
       "2               2      101              58               35              90   \n",
       "3               1      120              80               48             200   \n",
       "4              11      127             106                0               0   \n",
       "\n",
       "  mass_index diabetes_pedigree  age  class_var  \n",
       "0       34.2             0.142   22          0  \n",
       "1       38.7              0.24   25          1  \n",
       "2       21.8             0.155   22          0  \n",
       "3       38.9             1.162   41          0  \n",
       "4         39              0.19   51          0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = result.DataFrame()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of rows in dataset: {df.shape[0]}\n",
      "0    500\n",
      "1    268\n",
      "Name: class_var, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print('Number of rows in dataset: {df.shape[0]}')\n",
    "print(df[df.columns[8]].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "times_pregnant     0.221898\n",
       "glucose            0.466581\n",
       "blood_pressure     0.065068\n",
       "skin_fold_thick    0.074752\n",
       "serum_insuling     0.130548\n",
       "age                0.238356\n",
       "class_var          1.000000\n",
       "Name: class_var, dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corrs = df.corr()['class_var'].abs()\n",
    "columns = corrs[corrs > .01].index\n",
    "corrs = corrs.filter(columns)\n",
    "corrs"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Sample Jupyter notebook available on GitHub. Sample credentials JSON file.

jupyter notebook Data (computing) Python (language) CSV Machine learning Database Fetch (FTP client)

Published at DZone with permission of Andrejus Baranovskis, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Practical Example of Using CSS Layer
  • Using GPT-3 in Our Applications
  • 3 Main Pillars in ReactJS
  • Build an Automated Testing Pipeline With GitLab CI/CD and Selenium Grid

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: