Superset
POC: Raft
Table of Contents
-
Introduction
-
1.1. Purpose
-
1.2. Target Audience
-
1.3. Scope
-
-
Overview
-
2.1. Capability Description
-
2.2. Key Features
-
-
Available Data Sources
-
3.1. df-delta
-
3.2. df-kafka
-
3.3. df-pinot
-
3.4. df-postgres
-
3.5. df-postgis
-
3.6. df-arcadedb
-
-
Setup
-
4.1. Accessing Superset
-
4.2. Logging In
-
4.3. Navigation
-
Dashboards
-
Charts
-
Datasets
-
SQL Lab
-
Settings Menu
-
-
4.4. Additional Navigation Features
-
Favorites Section
-
Sorting and Filtering Options
-
Recents Section
-
Import Button
-
Bulk Select Button
-
-
4.5. Adding Data Sources
-
4.6. Connecting Disparate Data Sources
-
4.7. Using Local Data in Superset
-
-
Creating a New Chart & Dashboard
-
5.1. Creating a New Chart
-
5.2. Discrete vs Continuous Data
-
5.3. Table Customization
-
5.4. Scatterplot Customization
-
5.5. Map Customization
-
-
Validating Data Sources via SQL Lab
-
Usage
-
7.1. Common Use Cases
-
Real-Time Event Stream Analysis
-
Geospatial Data Analysis
-
Time-Series Data Analytics
-
Sensor Data Collection
-
Aggregating Data from Multiple Sources
-
Exporting Dashboards
-
Sharing Dashboards
-
-
7.2. Advanced Features
-
Real-Time Analytics with Pinot & Kafka
-
Geospatial Customization with WMS Endpoints
-
Time-Bounded Querying with df-delta and Pinot
-
Automated Data Lifecycle Management with Kafka TTL
-
-
-
Best Practices
-
8.1. Optimizing Query Performance
-
8.2. Efficient Dashboard Design
-
8.3. Effective Data Source Management
-
-
Troubleshooting
-
9.1. No Data Displayed in Chart/Dashboard
-
9.2. Chart Takes Too Long to Load
-
9.3. Missing or Inaccurate Data in Dashboard
-
9.4. SQL Query Errors in SQL Lab
-
9.5. Map Not Rendering Correctly
-
9.6. Unexpected Formatting or Visualization Issues
-
9.7. Dashboards Not Displaying Updates
-
-
Reference Materials
-
Glossary
-
Appendices
-
12.1. Appendix A: Common SQL Queries
-
12.2. Appendix B: Role-Based Access Control (RBAC) Setup
-
Introduction
1.1. Purpose
A comprehensive guide to understanding and utilizing Superset’s capabilities within Raft’s Data Fabric and the SOF Data Layer.
1.3. Scope
The Superset documentation for a data platform provides users with detailed guidance on using Apache Superset for data exploration and visualization. It explains how to connect to various data sources, create and customize interactive dashboards, and visualize data through a wide range of chart types. The documentation covers essential functionalities like building queries, filtering data, and setting up permissions for different user roles. Additionally, it offers best practices for optimizing performance, configuring Superset’s architecture, and integrating it with other platform components. This ensures users can effectively leverage Superset to analyze and present data insights in real-time.
Overview
2.1. Capability Description
Superset enables a system to provide powerful data exploration and visualization capabilities, allowing users to analyze and interact with data through customizable dashboards and charts. It connects seamlessly to various databases, enabling real-time querying and data analysis without the need for code. Users can build complex SQL queries using an intuitive drag-and-drop interface or write custom queries for more advanced analytics.
The system’s integration with Superset facilitates collaboration by allowing users to share dashboards, insights, and reports within the platform. It supports a wide range of visualizations, from bar charts and pie charts to geospatial maps and time-series graphs, empowering users to make data-driven decisions. Additionally, Superset’s role-based access controls and data caching mechanisms ensure secure, efficient access to data, enhancing system performance and governance. This capability makes the system a robust platform for exploring, visualizing, and communicating data insights.
2.2. Key Features
-
Visualizations: numerous chart types, including bar charts, line charts, pie charts, heatmaps, time-series graphs, and geospatial maps, enabling rich and varied data visualizations.
-
No-Code Data Exploration: explore datasets and build visualizations without needing to write SQL.
-
SQL Query Support: users can write custom SQL queries for more complex data analysis and reporting, allowing flexibility in how data is queried.
-
Interactive Dashboards: enables users to create and share interactive, real-time dashboards that can be customized with filters, drill-down capabilities, and responsive layouts.
-
Multi-Database Connectivity: supports a wide variety of databases, providing flexibility in connecting to different data sources.
-
Role-Based Access Control (RBAC): granular user and role-based permissions, ensuring that data access is controlled and secure for different user groups.
-
Real-Time Data Exploration: perform real-time analysis on live data, making it ideal for monitoring, operational analytics, and business intelligence.
-
Collaborative Workflows: supports collaboration through the ability to share dashboards, visualizations, and insights across teams, enhancing decision-making and data democratization.
Available Data Sources
3.1. df-delta
df-delta is the main data store where all data is persisted after being ingested through Kafka. Delta is suitable for structured, analytics-friendly data and supports real-time analysis. It is the primary destination for most datasets in Data Fabric. Data remains here after the Kafka time-to-live (TTL) limit is reached.
3.2. df-kafka
Kafka is used as a message broker to stream data in real-time before it is persisted in Delta. Data in Kafka is transient, with a TTL of 1 hour, meaning it gets dropped after 1 hour but is retained permanently in Delta. Kafka is optimal for real-time event streaming, though it is not used for long-term storage.
3.3. df-pinot
Pinot is a real-time analytics engine and OLAP store optimized for low-latency queries. It automatically pulls real-time data from Kafka streams for immediate complex analysis, making it ideal for dashboards or maps requiring real-time views, like monitoring events or detecting anomalies. While Pinot excels in fast, time-critical queries, large-scale data storage is handled by Delta Lake. Pinot is best used for scenarios requiring quick, real-time analysis.
3.4. df-postgres
PostgreSQL is used mainly for manual data uploads, allowing users to store relational data in a structured, SQL-compatible database. This source is suited for users who prefer uploading CSV files or working with custom schemas.
Setup
4.1. Accessing Superset
There are two ways to navigate to Superset within Data Fabric: - From the Data Fabric Home page, select the Superset card at the bottom left. image::/assets/images/home.png[alt="Data Fabric Home"] - You can also access Superset via the Data Insights tab directly, selecting Visualizations. image::/assets/images/nav_data_insights.png[alt="Data Fabric Insights Tab"]
4.2. Logging In
Log in to Superset using your Keycloak credentials, which manages access to the application and ensures that your account has appropriate access levels.
4.3. Navigation
-
Dashboards: View and manage your existing dashboards, or create a new one.
-
Charts: Create new visualizations from datasets.
-
Datasets: Manage and explore datasets.
-
SQL Lab: Run custom SQL queries. image::/assets/images/superset.png[alt="Data Insights Tab"]
-
Settings Menu: Manage users, security settings, and database connections.
4.4. Additional Navigation Features
-
Favorites Section: Quickly access any dashboards or charts you’ve marked as favorites.
-
Sorting and Filtering Options: Sort and filter items by attributes (e.g., date created, modified).
-
Recents Section: Displays recently viewed, edited, or created items.
-
Import Button: Allows importing visualizations from JSON files.
-
Bulk Select Button: Select multiple items at once for bulk actions.
4.5. Adding Data Sources
To add or verify data sources: - Navigate to the Datasets tab. - Click on + Dataset to start the process. - Select a Data Fabric-managed database, then choose the appropriate schema and table.
Creating a New Chart & Dashboard
5.1. Creating a New Chart
After saving the dataset, choose a chart type that fits your data, such as a Bar Chart, Line Chart, or Pie Chart.
5.2. Discrete vs Continuous Data
Discrete data represents distinct, separate values, while continuous data can take any value within a range. Choose your chart type accordingly.
5.3. Table Customization
Select columns to display under Dimensions, and apply filters or aggregations under Metrics.
Validating Data Sources via SQL Lab
Test data connectivity by running queries in SQL Lab, checking data availability, and verifying correct ingestion of datasets.
Usage
7.1. Common Use Cases
-
Real-Time Event Stream Analysis: Monitor real-time data streams (e.g., security events, sensor data) using Kafka and Superset dashboards.
-
Geospatial Data Analysis: Visualize geospatial data for mission planning or situational awareness.
-
Time-Series Data Analytics: Analyze performance metrics and trends over time using time-series visualizations.
-
Sensor Data Collection and Environmental Monitoring: Use sensor data for environmental or operational monitoring.
-
Aggregating and Enriching Data from Multiple Sources: Combine data from multiple sources for cross-analysis in dashboards.
-
Exporting Dashboards: Export dashboards as JSON files to reuse in other Superset instances.
-
Sharing Dashboards: Share dashboards with links, adjusting permissions for view or edit access.
7.2. Advanced Features
-
Real-Time Analytics with Pinot & Kafka Integration: Use Pinot’s OLAP capabilities to query real-time data from Kafka streams for fast, low-latency analysis.
-
Geospatial Customization with WMS Endpoints: Customize maps with layers such as streets, parks, lights, and satellite views, or add custom WMS endpoints.
-
Time-Bounded Querying with df-delta and Pinot: Filter data by specific time ranges for optimized querying of time-series data.
-
Automated Data Lifecycle Management with Kafka TTL: Configure Kafka TTL settings to manage data lifecycle, with real-time streaming in Kafka and long-term storage in df-delta.
Best Practices
Troubleshooting
9.1. No Data Displayed in Chart/Dashboard
Verify dataset connectivity and check SQL queries for errors.
9.5. Map Not Rendering Correctly
Verify correct latitude/longitude columns and WMS endpoint configurations.
Reference Materials
-
Apache Superset Documentation - The official Apache Superset documentation provides a comprehensive guide to Superset’s functionality, including chart types, SQL Lab, and how to manage dashboards.
-
Apache Pinot Documentation - If using the df-pinot database for OLAP queries, review the Apache Pinot documentation for configuring queries and optimizing performance.
-
PostGIS Guide - When working with geospatial data, the PostGIS documentation provides detailed guidance on spatial queries and optimizations.
-
NASA GIBS WMS - A guide to NASA’s Global Imagery Browse Services (GIBS) for geospatial WMS layers.
-
GeoServer WMS Documentation - Provides information on configuring and using WMS layers with GeoServer.
Glossary
Aggregation: The process of summarizing data, often by computing statistics like sum, average, or count, commonly applied to metrics in visualizations.
Apache Superset: An open-source data exploration and visualization platform used for building interactive dashboards and visualizing large datasets.
Dashboard: A collection of visualizations or charts organized into a single interface for monitoring and analyzing data in Superset.
Datasets: Collections of data available for querying and visualization within Superset, typically stored in databases like df-delta, df-pinot, or df-postgis.
df-arcadedb: A multi-model database capable of handling graph and document-based data. Typically used on a case-by-case basis within Data Fabric.
#### Through Data Catalog: a. From the dataset view, Click `OPEN IN SQL LAB` ![](/images/screenshots/df-superset-dataset-view.png) 2. Select the appropriate Database. (`df-delta` is where your delta data would be) ![](/images/screenshots/df-superset-database.png)
-
Execute queries against Database and visualize. ![](/images/screenshots/df-superset-query-execution.png)