Using Python in a Space

To use Python in a Space you must

Active a space session via a secure desktop
Select the JupyterLab tool icon (on legacy Space specifications this may be an Anaconda icon)

The primary development tool for Python is Jupyter. Current Space specifications utilize the JupyterLab interface, whereas Legacy Space specifications may use the Jupyter notebook interface.

The core set of capabilities provided by JupyterLab are :

Interactive Python development environment
Inline, browser based outputs - including visualizations
A common, Space wide collaboration area for the management of user defined code, notebooks and artefacts
Code auto-complete assistant
Python environment Kernel management

JupyterLab Home Area

Once you have clicked on the JupyterLab icon, the JupyterLab home area will be loaded. The home area has three primary functions :

Provide a persistent storage location for user defined code, notebooks and artefacts. A file explorer is provided to navigate through this persistent, writeable storage location that is shared across all collaborators in a Space and is persistent across Space sessions.
Provide access to guidance, tutorial and use case specific analytical resources provided by the platform host. These will often be provided within structured folders that are accessible from the home area and provide generic guidance for tool usage, as well as data product and use case specific walk-throughs that support the rapid generation of insights and outcomes.
A launcher for the creation of new :
1. Python notebooks
2. PySpark notebooks
3. Text Files

Guidance notebooks provisioned into the JupyterLab home area provide the best and most meaningful walk-through examples of using Python and PySpark. As such this guide provides a high level overview of Python concepts and it is recommended to utilise these in-platform resources for detailed familiarization

Python and PySpark

Two kernel types are supported within JupyterLab :

Python : A pure python environment
PySpark : An enhanced python environment with a pre-configured PySpark connection to the Spaces metastore described in Using SQL in a Space.

PySpark notebooks are the recommended Kernel type to use in the vast majority of usage scenarios - Guidance notebooks will most likely be created using this Kernel, which will be automatically selected when the notebooks are opened.

PySpark

Notebooks utilising the PySpark kernels come pre-configured with a Spark environment that facilitates the usage of distributed compute approaches for highly performant data exploration, analysis and model development workloads, including the provision of the components :

SparkSQL - Spark based SQL execution against the Spaces metastore
MLLib - A library of machine learning algorithms, pipelines and capabilities that are optimised for execution on parallelised, cluster compute infrastructure

Detailed walk-throughs on the utilisation of these Spark components are provided via in platform guidance materials

Installing Python Packages

The python environments are provisioned with a core set of packages. A full listing of these packages can be found by running the code snippet below in any Jupyter cell:

import pkgutil
packages = pkgutil.iter_modules()
for package in packages:
    print(package.name)

Additional packages may also be made available via a secured PyPi repository instance, the package content of which is determined by the platform administrator. To install an additional package from this repository utilize the Jupyter native pip method to install via PIP, e.g. within a JupyterLab notebook cell run the command :

pip install <package-name>

Note : On legacy / compatibility space specification you may need to use the command

!pip install <package-name>

If the package you require is not available within the repository contact your platform administrator who will be able arrange the inclusion of the package, assuming that the package complies with any repository policies and / or standards.

Performance Optimization (coming soon to all customers)

The default Python compute engine used in Spaces is Pyspark, which provides a robust, scaleable and in-memory compute execution framework for data science and analytical activities within Spaces.

An alternative query engine called Presto (aka Trino) is also made available which can provide up to 10x performance increase when querying data products. The pyhive library is utilized to access the Presto query engine using the standard configuration code shown below :

import pyhive
from pyhive import presto

conn = presto.Connection(host='localhost',port=8060)

This connection can then be leveraged in two ways :

Create a Pandas dataframe directly from query results using the Pandas read_sql method

import pandas as pd
df = pd.read_sql('<sql statement>',conn)

Note : this approach returns the full result to a Pandas dataframe. It is therefore best used when writing queries that return small to medium sized results sets (<500,000 records). Returning large results in this way may cause stability issues in your session due to the volume of data being returned to a single threaded process

2. Utilise SQLAlchemy cursor construct to issue remote queries with paginated results return

sql = "<sql statement>"
curr = conn.cursor()
curr.execute(sql)
results = curr.fetchone()

This approach can also be used to execute “Create Table As Select …“ statements via Presto to create output tables into collaborate_db and publish_db.

It is recommended to use this approach for all queries with larger results sets and / or where you wish to persist the query results into an output table which can be used by other processes.

References and FAQs

Asset from Space

Assets from Tasks

JupyterLab

About the Tools in Spaces