Return to Databricks Cloud Shell and Apache Spark
Databricks, Inc. is a global data, analytics, and artificial intelligence (AI) company founded by the original creators of Apache Spark.
The company provides a cloud-based platform to help enterprises build, scale, and govern data and AI, including generative AI and other machine learning models.
Databricks pioneered the data lakehouse, a data and AI platform that combines the capabilities of a data warehouse with a data lake, allowing organizations to manage and use both structured and unstructured data for traditional business analytics and AI workloads.
In November 2023, Databricks unveiled the Databricks Data Intelligence Platform, a new offering that combines the unification benefits of the lakehouse with MosaicML’s Generative AI technology to enable customers to better understand and use their own proprietary data.
The company develops Delta Lake, an open-source project to bring reliability to data lakes for machine learning and other data science use cases.
AN OPEN AND UNIFIED DATA ANALYTICS PLATFORM FOR DATA ENGINEERING, MACHINE LEARNING, AND ANALYTICS
From the original creators of Apache Spark, Delta Lake, MLflow, and Koalas
Select a platform:
DATABRICKS PLATFORM - FREE TRIAL
For businesses
CHOOSE YOUR CLOUD
Please note that Azure Databricks is provided by Microsoft Azure and is subject to Microsoft's terms. By clicking on the “AWS” button to get started, you agree to the Databricks Terms of Service. By clicking on the “Google Cloud” button to get started, you agree to the Databricks Terms of Service.
For students and educational institutions
By clicking “Get Started” for the Community Edition, you agree to the Databricks Community Edition Terms of Service.
https://databricks.com/try-platform
Welcome to Databricks Community Edition!
Databricks Community Edition provides you with access to a free Spark micro-cluster as well as a cluster manager and a notebook environment - ideal for developers, data scientists, data engineers and other IT professionals to get started with Spark.
We need you to verify your email address by clicking on this link. You will then be redirected to Databricks Community Edition!
Get started by visiting: https://community.cloud.databricks.com/login.html
If you have any questions, please contact feedback@databricks.com.
- The Databricks Team
Instructor: Adam Breindel
LinkedIn: https://www.linkedin.com/in/adbreind
Email: adbreind@gmail.com
Twitter: @adbreind - 20+ years building systems for startups and large enterprises - 10+ years teaching data, ML, front- and back-end technology - Fun large-scale data projects… - Streaming neural net + decision tree fraud scoring - Realtime & offline analytics for banking - Music synchronization and licensing for networked jukeboxes - Industries - Finance, Insurance - Travel, Media / Entertainment - Energy, Government
Create a Databricks account • Sign up for free Community Edition now at https://databricks.com/try-databricks • Use Firefox, Chrome or Safari
Getting Started
These steps are illustrated on subsequent pages; this is the summary: 1. Copy the courseware link or prepare to type it ☺ https://materials.s3.amazonaws.com/2021/oreilly0708/spark.dbc
2. Import that file into your Databricks account per the instructions on the following slides.
3. Create a cluster: choose Databricks Runtime 8.2 (illustrated in the following slides) Setup with Databricks
3 Log in to Databricks
4 Import Notebooks… 1 2 3 5 Import Notebooks for Today… Type or paste in today’s notebook URL … then click Import Choose URL… https://materials.s3.amazonaws.com/2021/oreilly0708/spark.dbc 6 Find your notebook(s) here… Click Workspace Here are your notebooks 7 Create a Cluster 1 2 8 Create a Cluster 3 4 5 Runtime 8.2 (Scala 2.12, Spark 3.1.1) 9 All set: let's go!
https://learning.oreilly.com/live-events/spark-31-first-steps/0636920371533/0636920061945/
https://on24static.akamaized.net/event/33/46/13/4/rt/1/documents/resourceList1631717069306/setup.pdf
Create Cluster New Cluster 0 Workers:0 GB Memory, 0 Cores, 0 DBU 1 Driver:15.3 GB Memory, 2 Cores, 1 DBU UI|JSON Cluster Name Buddha Databricks Runtime Version Runtime: 8.3 (Scala 2.12, Spark 3.1.1) NoteDatabricks Runtime 8.x and later use Delta Lake as the default table format. Learn more Instance Free 15 GB Memory: As a Community Edition user, your cluster will automatically terminate after an idle period of two hours. For more configuration options, please upgrade your Databricks subscription. Instances Spark
Import Notebooks
Import from: File URL
https://materials.s3.amazonaws.com/2021/oreilly0708/spark.dbc
Accepted formats: .dbc, .scala, .py, .sql, .r, .ipynb, .Rmd, .html, .zip
This notebook is not attached to a cluster. Would you like to launch a new Spark cluster to continue working?
Automatically launch and attach to clusters without prompting
First Steps with Apache Spark 3.1
Class Logistics and Operations
Topics
ls /databricks-datasets/amazon/ path - name - size
Where is this Spark data source? Currently, Amazon S3
In this version of Databricks, “DBFS” (a wrapper over S3 similar to EMRFS) is the default Spark filesystem.
Other common defaults include HDFS, “local”, another cloud-based object store like Azure Blob Storage, or a Kubernetes-friendly storage layer like Minio.
%sql SELECT * FROM parquet.`/databricks-datasets/amazon/data20K`
%sql SELECT * FROM parquet.`/databricks-datasets/amazon/data20K`
Job 2 View(Stages: 1/1)
df:org.apache.spark.sql.DataFrame
rating:double
review:string
df: org.apache.spark.sql.DataFrame = [rating: double, review: string]