About the Tools in Spaces

Tool

Details

Apache Hadoop is a software platform designed for distributed storage and processing of large data sets across computer clusters. Many of the tools used in the Harbr platform are built on top of the Hadoop framework.

Apache Spark is an in-memory, distributed compute engine for general data engineering and advanced analytics. Spark provides the interface to Data Products on the Harbr platform for other analytical programming languages such as Python, R and SQL. 

Apache Hive is a data warehouse software built on top of Hadoop which facilitates querying and analysis of big-data in an SQL-based framework. HIVE provides the access mechanism to Data Products that facilitates their instantaneous deployment within a Space.

HUE is the in-built SQL Developer environment that provides a browser based tool to visualize the databases and interact with Data Products. HUE queries are executed in HIVE and written in HIVEQL, an ANSI compliant variant of SQL that combines the standard SQL language with advanced big data capabilities.

 

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

The RStudio application makes the programming language R available and comes with a managed copy of CRAN for package management. R interfaces to the data products via SparkR, with the Sparklyr package also available.

Apache Zeppelin is a web-based notebook that enables interactive data analytics. It supports Python, but also a growing list of programming languages such as Scala, Hive, SparkSQL, shell and markdown.

 

 

Superset is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data, from simple line charts to highly detailed geospatial charts.

 

 

Trino is a fast distributed SQL query engine for big data analytics that allows you to explore and analyse both Data Products and Remote Data Assets that have been added to a Space. Trino can be leveraged directly in the Spaces SQL development tool or through connections defined in JupyterLab notebooks via the PyHive library.