Using Python in a Space
To use Python in a Space you must
Active a space session via a secure desktop
Select the JupyterLab tool icon (on legacy Space specifications this may be an Anaconda icon)
The primary development tool for Python is Jupyter. Current Space specifications utilize the JupyterLab interface, whereas Legacy Space specifications may use the Jupyter notebook interface.
The core set of capabilities provided by JupyterLab are :
Interactive Python development environment
Inline, browser based outputs - including visualizations
A common, Space wide collaboration area for the management of user defined code, notebooks and artefacts
Code auto-complete assistant
Python environment Kernel management
JupyterLab Home Area
Once you have clicked on the JupyterLab icon, the JupyterLab home area will be loaded. The home area has three primary functions :
Provide a persistent storage location for user defined code, notebooks and artefacts. A file explorer is provided to navigate through this persistent, writeable storage location that is shared across all collaborators in a Space and is persistent across Space sessions.
Provide access to guidance, tutorial and use case specific analytical resources provided by the platform host. These will often be provided within structured folders that are accessible from the home area and provide generic guidance for tool usage, as well as data product and use case specific walk-throughs that support the rapid generation of insights and outcomes.
A launcher for the creation of new :
Python notebooks
PySpark notebooks
Text Files
Guidance notebooks provisioned into the JupyterLab home area provide the best and most meaningful walk-through examples of using Python and PySpark. As such this guide provides a high level overview of Python concepts and it is recommended to utilise these in-platform resources for detailed familiarization
Python and PySpark
Two kernel types are supported within JupyterLab :
Python : A pure python environment
PySpark : An enhanced python environment with a pre-configured PySpark connection to the Spaces metastore described in Using SQL in a Space.
PySpark notebooks are the recommended Kernel type to use in the vast majority of usage scenarios - Guidance notebooks will most likely be created using this Kernel, which will be automatically selected when the notebooks are opened.
PySpark
Notebooks utilising the PySpark kernels come pre-configured with a Spark environment that facilitates the usage of distributed compute approaches for highly performant data exploration, analysis and model development workloads, including the provision of the components :
SparkSQL - Spark based SQL execution against the Spaces metastore
MLLib - A library of machine learning algorithms, pipelines and capabilities that are optimised for execution on parallelised, cluster compute infrastructure
Detailed walk-throughs on the utilisation of these Spark components are provided via in platform guidance materials
Installing Python Packages
The python environments are provisioned with a core set of packages. A full listing of these packages can be found by running the code snippet below in any Jupyter cell:
import pkgutil
packages = pkgutil.iter_modules()
for package in packages:
print(package.name)
Additional packages may also be made available via a secured PyPi repository instance, the package content of which is determined by the platform administrator. To install an additional package from this repository utilize the Jupyter native pip method to install via PIP, e.g. within a JupyterLab notebook cell run the command :
pip install <package-name>
If the package you require is not available within the repository contact your platform administrator who will be able arrange the inclusion of the package, assuming that the package complies with any repository policies and / or standards.
Performance Optimization (coming soon to all customers)
The default Python compute engine used in Spaces is Pyspark, which provides a robust, scaleable and in-memory compute execution framework for data science and analytical activities within Spaces.
An alternative query engine called Presto (aka Trino) is also made available which can provide up to 10x performance increase when querying data products. The pyhive library is utilized to access the Presto query engine using the standard configuration code shown below :
import pyhive
from pyhive import presto
conn = presto.Connection(host='localhost',port=8060)
This connection can then be leveraged in two ways :
Create a Pandas dataframe directly from query results using the Pandas read_sql method
2. Utilise SQLAlchemy cursor construct to issue remote queries with paginated results return
References and FAQs