Superset

POC: Raft

Introduction
- 1.1. Purpose
- 1.2. Target Audience
- 1.3. Scope
Overview
- 2.1. Capability Description
- 2.2. Key Features
Available Data Sources
- 3.1. df-delta
- 3.2. df-kafka
- 3.3. df-pinot
- 3.4. df-postgres
- 3.5. df-postgis
- 3.6. df-arcadedb
Setup
- 4.1. Accessing Superset
- 4.2. Logging In
- 4.3. Navigation
  - Dashboards
  - Charts
  - Datasets
  - SQL Lab
  - Settings Menu
- 4.4. Additional Navigation Features
  - Favorites Section
  - Sorting and Filtering Options
  - Recents Section
  - Import Button
  - Bulk Select Button
- 4.5. Adding Data Sources
- 4.6. Connecting Disparate Data Sources
- 4.7. Using Local Data in Superset
Creating a New Chart & Dashboard
- 5.1. Creating a New Chart
- 5.2. Discrete vs Continuous Data
- 5.3. Table Customization
- 5.4. Scatterplot Customization
- 5.5. Map Customization
Validating Data Sources via SQL Lab
Usage
- 7.1. Common Use Cases
  - Real-Time Event Stream Analysis
  - Geospatial Data Analysis
  - Time-Series Data Analytics
  - Sensor Data Collection
  - Aggregating Data from Multiple Sources
  - Exporting Dashboards
  - Sharing Dashboards
- 7.2. Advanced Features
  - Real-Time Analytics with Pinot & Kafka
  - Geospatial Customization with WMS Endpoints
  - Time-Bounded Querying with df-delta and Pinot
  - Automated Data Lifecycle Management with Kafka TTL
Best Practices
- 8.1. Optimizing Query Performance
- 8.2. Efficient Dashboard Design
- 8.3. Effective Data Source Management
Troubleshooting
- 9.1. No Data Displayed in Chart/Dashboard
- 9.2. Chart Takes Too Long to Load
- 9.3. Missing or Inaccurate Data in Dashboard
- 9.4. SQL Query Errors in SQL Lab
- 9.5. Map Not Rendering Correctly
- 9.6. Unexpected Formatting or Visualization Issues
- 9.7. Dashboards Not Displaying Updates
Reference Materials
Glossary
Appendices
- 12.1. Appendix A: Common SQL Queries
- 12.2. Appendix B: Role-Based Access Control (RBAC) Setup

Introduction

1.1. Purpose

A comprehensive guide to understanding and utilizing Superset’s capabilities within Raft’s Data Fabric and the SOF Data Layer.

1.2. Target Audience

Data Stewards, Data Analysts

1.3. Scope

The Superset documentation for a data platform provides users with detailed guidance on using Apache Superset for data exploration and visualization. It explains how to connect to various data sources, create and customize interactive dashboards, and visualize data through a wide range of chart types. The documentation covers essential functionalities like building queries, filtering data, and setting up permissions for different user roles. Additionally, it offers best practices for optimizing performance, configuring Superset’s architecture, and integrating it with other platform components. This ensures users can effectively leverage Superset to analyze and present data insights in real-time.

Overview

2.1. Capability Description

Superset enables a system to provide powerful data exploration and visualization capabilities, allowing users to analyze and interact with data through customizable dashboards and charts. It connects seamlessly to various databases, enabling real-time querying and data analysis without the need for code. Users can build complex SQL queries using an intuitive drag-and-drop interface or write custom queries for more advanced analytics.

The system’s integration with Superset facilitates collaboration by allowing users to share dashboards, insights, and reports within the platform. It supports a wide range of visualizations, from bar charts and pie charts to geospatial maps and time-series graphs, empowering users to make data-driven decisions. Additionally, Superset’s role-based access controls and data caching mechanisms ensure secure, efficient access to data, enhancing system performance and governance. This capability makes the system a robust platform for exploring, visualizing, and communicating data insights.

2.2. Key Features

Visualizations: numerous chart types, including bar charts, line charts, pie charts, heatmaps, time-series graphs, and geospatial maps, enabling rich and varied data visualizations.
No-Code Data Exploration: explore datasets and build visualizations without needing to write SQL.
SQL Query Support: users can write custom SQL queries for more complex data analysis and reporting, allowing flexibility in how data is queried.
Interactive Dashboards: enables users to create and share interactive, real-time dashboards that can be customized with filters, drill-down capabilities, and responsive layouts.
Multi-Database Connectivity: supports a wide variety of databases, providing flexibility in connecting to different data sources.
Role-Based Access Control (RBAC): granular user and role-based permissions, ensuring that data access is controlled and secure for different user groups.
Real-Time Data Exploration: perform real-time analysis on live data, making it ideal for monitoring, operational analytics, and business intelligence.
Collaborative Workflows: supports collaboration through the ability to share dashboards, visualizations, and insights across teams, enhancing decision-making and data democratization.

Available Data Sources

3.1. df-delta

df-delta is the main data store where all data is persisted after being ingested through Kafka. Delta is suitable for structured, analytics-friendly data and supports real-time analysis. It is the primary destination for most datasets in Data Fabric. Data remains here after the Kafka time-to-live (TTL) limit is reached.

3.2. df-kafka

Kafka is used as a message broker to stream data in real-time before it is persisted in Delta. Data in Kafka is transient, with a TTL of 1 hour, meaning it gets dropped after 1 hour but is retained permanently in Delta. Kafka is optimal for real-time event streaming, though it is not used for long-term storage.

3.3. df-pinot

Pinot is a real-time analytics engine and OLAP store optimized for low-latency queries. It automatically pulls real-time data from Kafka streams for immediate complex analysis, making it ideal for dashboards or maps requiring real-time views, like monitoring events or detecting anomalies. While Pinot excels in fast, time-critical queries, large-scale data storage is handled by Delta Lake. Pinot is best used for scenarios requiring quick, real-time analysis.

3.4. df-postgres

PostgreSQL is used mainly for manual data uploads, allowing users to store relational data in a structured, SQL-compatible database. This source is suited for users who prefer uploading CSV files or working with custom schemas.

3.5. df-postgis

PostGIS is used for storing and querying spatial data. It supports case-by-case geospatial analyses and can be used when working with maps and spatial datasets.

3.6. df-arcadedb

ArcadeDB is a multi-model database that can handle graphs, documents, and other non-relational data models. It is used in specific cases where these data models are required, though it is not the default destination for most datasets.

Setup

4.1. Accessing Superset

There are two ways to navigate to Superset within Data Fabric: - From the Data Fabric Home page, select the Superset card at the bottom left. image::/assets/images/home.png[alt="Data Fabric Home"] - You can also access Superset via the Data Insights tab directly, selecting Visualizations. image::/assets/images/nav_data_insights.png[alt="Data Fabric Insights Tab"]

4.2. Logging In

Log in to Superset using your Keycloak credentials, which manages access to the application and ensures that your account has appropriate access levels.

4.3. Navigation

Dashboards: View and manage your existing dashboards, or create a new one.
Charts: Create new visualizations from datasets.
Datasets: Manage and explore datasets.
SQL Lab: Run custom SQL queries. image::/assets/images/superset.png[alt="Data Insights Tab"]
Settings Menu: Manage users, security settings, and database connections.

4.4. Additional Navigation Features

Favorites Section: Quickly access any dashboards or charts you’ve marked as favorites.
Sorting and Filtering Options: Sort and filter items by attributes (e.g., date created, modified).
Recents Section: Displays recently viewed, edited, or created items.
Import Button: Allows importing visualizations from JSON files.
Bulk Select Button: Select multiple items at once for bulk actions.

4.5. Adding Data Sources

To add or verify data sources: - Navigate to the Datasets tab. - Click on + Dataset to start the process. - Select a Data Fabric-managed database, then choose the appropriate schema and table.

4.6. Connecting Disparate Data Sources

Superset allows users to combine data from different databases (e.g., df-postgres and df-kafka) in dashboards and queries for cross-analysis.

4.7. Using Local Data in Superset

Analysts can upload local datasets (e.g., CSVs) to databases like df-postgres, then connect them to Superset for querying and visualization.

Creating a New Chart & Dashboard

5.1. Creating a New Chart

After saving the dataset, choose a chart type that fits your data, such as a Bar Chart, Line Chart, or Pie Chart.

5.2. Discrete vs Continuous Data

Discrete data represents distinct, separate values, while continuous data can take any value within a range. Choose your chart type accordingly.

5.3. Table Customization

Select columns to display under Dimensions, and apply filters or aggregations under Metrics.

5.4. Scatterplot Customization

Select numeric columns for the X and Y axes, and customize the points on the scatterplot by size, color, and filters.

5.5. Map Customization

Select latitude and longitude columns for geospatial data, and customize the map with WMS endpoints and layers.

Validating Data Sources via SQL Lab

Test data connectivity by running queries in SQL Lab, checking data availability, and verifying correct ingestion of datasets.

Usage

7.1. Common Use Cases

Real-Time Event Stream Analysis: Monitor real-time data streams (e.g., security events, sensor data) using Kafka and Superset dashboards.
Geospatial Data Analysis: Visualize geospatial data for mission planning or situational awareness.
Time-Series Data Analytics: Analyze performance metrics and trends over time using time-series visualizations.
Sensor Data Collection and Environmental Monitoring: Use sensor data for environmental or operational monitoring.
Aggregating and Enriching Data from Multiple Sources: Combine data from multiple sources for cross-analysis in dashboards.
Exporting Dashboards: Export dashboards as JSON files to reuse in other Superset instances.
Sharing Dashboards: Share dashboards with links, adjusting permissions for view or edit access.

7.2. Advanced Features

Real-Time Analytics with Pinot & Kafka Integration: Use Pinot’s OLAP capabilities to query real-time data from Kafka streams for fast, low-latency analysis.
Geospatial Customization with WMS Endpoints: Customize maps with layers such as streets, parks, lights, and satellite views, or add custom WMS endpoints.
Time-Bounded Querying with df-delta and Pinot: Filter data by specific time ranges for optimized querying of time-series data.
Automated Data Lifecycle Management with Kafka TTL: Configure Kafka TTL settings to manage data lifecycle, with real-time streaming in Kafka and long-term storage in df-delta.

Best Practices

8.1. Optimizing Query Performance

Use time filters, aggregate data at query level, and set row limits to improve performance.

8.2. Efficient Dashboard Design

Limit visualizations per dashboard, and select appropriate chart types for your data.

8.3. Effective Data Source Management

Verify datasets in SQL Lab before building dashboards, and monitor data lifecycle with Kafka TTL.

Troubleshooting

9.1. No Data Displayed in Chart/Dashboard

Verify dataset connectivity and check SQL queries for errors.

9.2. Chart Takes Too Long to Load

Reduce data volume by applying filters or setting row limits.

9.3. Missing or Inaccurate Data in Dashboard

Check for data ingestion issues or schema changes.

9.4. SQL Query Errors in SQL Lab

Check SQL syntax and correct table/schema selection.

9.5. Map Not Rendering Correctly

Verify correct latitude/longitude columns and WMS endpoint configurations.

9.6. Unexpected Formatting or Visualization Issues

Choose appropriate chart types and reduce the number of metrics.

9.7. Dashboards Not Displaying Updates

Clear the cache and verify data ingestion.

Reference Materials

Apache Superset Documentation - The official Apache Superset documentation provides a comprehensive guide to Superset’s functionality, including chart types, SQL Lab, and how to manage dashboards.
Apache Pinot Documentation - If using the df-pinot database for OLAP queries, review the Apache Pinot documentation for configuring queries and optimizing performance.
PostGIS Guide - When working with geospatial data, the PostGIS documentation provides detailed guidance on spatial queries and optimizations.
NASA GIBS WMS - A guide to NASA’s Global Imagery Browse Services (GIBS) for geospatial WMS layers.
GeoServer WMS Documentation - Provides information on configuring and using WMS layers with GeoServer.

Glossary

Aggregation: The process of summarizing data, often by computing statistics like sum, average, or count, commonly applied to metrics in visualizations.

Apache Superset: An open-source data exploration and visualization platform used for building interactive dashboards and visualizing large datasets.

Dashboard: A collection of visualizations or charts organized into a single interface for monitoring and analyzing data in Superset.

Datasets: Collections of data available for querying and visualization within Superset, typically stored in databases like df-delta, df-pinot, or df-postgis.

df-arcadedb: A multi-model database capable of handling graph and document-based data. Typically used on a case-by-case basis within Data Fabric.

    #### Through Data Catalog:
    a. From the dataset view, Click `OPEN IN SQL LAB`
    ![](/images/screenshots/df-superset-dataset-view.png)
2. Select the appropriate Database. (`df-delta` is where your delta data would be)
    ![](/images/screenshots/df-superset-database.png)

Execute queries against Database and visualize. ![](/images/screenshots/df-superset-query-execution.png)