All the data you need.

Tag: Spark

Announcing Photon Engine General Availability on the Databricks Lakehouse Platform
We are pleased to announce that Photon, the record-setting next-generation query engine for lakehouse systems, is now generally available on Databricks across all… The post Announcing Photon Engine General Availability on the Databricks Lakehouse Platform appeared first on Databricks.
Introducing Spark Connect – The Power of Apache Spark, Everywhere
At last week’s Data and AI Summit, we highlighted a new project called Spark Connect in the opening keynote. This blog post walks… The post Introducing Spark Connect – The Power of Apache Spark, Everywhere appeared first on Databricks.
Databricks Announces Major Contributions to Flagship Open Source Projects
Databricks announced that the company will contribute all features and enhancements it has made to Delta Lake to the Linux Foundation and open source all Delta Lake APIs as part of the Delta Lake 2.0 release. In addition, the company announced MLflow 2.0, which includes MLflow Pipelines, a new feature …
How to Monitor Streaming Queries in PySpark
Streaming is one of the most important data processing techniques for ingestion and analysis. It provides users and developers with low latency and… The post How to Monitor Streaming Queries in PySpark appeared first on Databricks.
Disaster Recovery Overview, Strategies, and Assessment
When deciding on a Disaster Recovery (DR) strategy that serves the entire firm for most applications and systems, an assessment of priorities, capabilities,… The post Disaster Recovery Overview, Strategies, and Assessment appeared first on Databricks.
Databricks’ Open Source Genomics Toolkit Outperforms Leading Tools
Genomic technologies are driving the creation of new therapeutics, from RNA vaccines to gene editing and diagnostics. Progress in these areas motivated us to build Glow, an open-source toolkit for genomics machine learning and data analytics. The toolkit is natively built on Apache Spark™, the leading engine for big data …
Introducing SQL User-Defined Functions
A user-defined function (UDF) is a means for a user to extend the native capabilities of Apache Spark™ SQL. Spark SQL has supported external user-defined functions written in Scala, Java, Python and R programming languages since 1.3.0. While external UDFs are very powerful, they also come with a few caveats: …
Introducing Apache Spark™ 3.2
We are excited to announce the availability of Apache Spark™ 3.2 on Databricks as part of Databricks Runtime 10.0. We want to thank the Apache Spark community for their valuable contributions to the Spark 3.2 release. The number of monthly maven downloads of Spark has rapidly increased to 20 million. …
How Incremental ETL Makes Life Simpler With Data Lakes
Incremental ETL (Extract, Transform and Load) in a conventional data warehouse has become commonplace with CDC (change data capture) sources, but scale, cost, accounting for state and the lack of machine learning access make it less than ideal. In contrast, incremental ETL in a data lake hasn’t been possible due …
How We Built Databricks on Google Kubernetes Engine (GKE)
Our release of Databricks on Google Cloud Platform (GCP) was a major milestone toward a unified data, analytics and AI platform that is truly multi-cloud. Databricks on GCP, a jointly-developed service that allows you to store all of your data on a simple, open lakehouse platform, is based on standard …
An Experimentation Pipeline for Extracting Topics From Text Data Using PySpark
This post is part of a series of posts on topic modeling. Topic modeling is the process of extracting topics from a set of text documents. This is useful for understanding or summarizing large collections of text documents. A document can be a line of text, a paragraph or a …
The Delta Between ML Today and Efficient ML Tomorrow
Delta Lake and MLflow both come up frequently in conversation but often as two entirely separate products. This blog will focus on the synergies between Delta Lake and MLflow for machine learning use cases and explain how you can leverage Delta Lake to deliver strong ML results based on solid …
AML Solutions at Scale Using Databricks Lakehouse Platform
Anti-Money Laundering (AML) compliance has been undoubtedly one of the top agenda items for regulators providing oversight of financial institutions across the globe. As AML evolved and became more sophisticated over the decades, so have the regulatory requirements designed to counter modern money laundering and terrorist financing schemes. The Bank …
Get Your Free Copy of Delta Lake: The Definitive Guide (Early Release)
At the Data + AI Summit, we were thrilled to announce the early release of Delta Lake: The Definitive Guide, published by O’Reilly. The guide teaches how to build a modern lakehouse architecture that combines the performance, reliability and data integrity of a warehouse with the flexibility, scale and support …
Don’t Miss These Top 10 Announcements From Data + AI Summit
The 2021 Data + AI Summit was filled with so many exciting announcements for open source and Databricks, talks from top-tier creators across the industry (such as Rajat Monga, co-creator of TensorFlow) and guest luminaries like Bill Nye, Malala Yousafzai and the NASA Mars Rover team. You can watch the …
What’s New in Apache Spark™ 3.1 Release for Structured Streaming
Along with providing the ability for streaming processing based on Spark Core and SQL API, Structured Streaming is one of the most important components for Apache Spark™. In this blog post, we summarize the notable improvements for Spark Streaming in the latest 3.1 release, including a new streaming table API, …
A Guide to Data + AI Summit Sessions: Machine Learning, Data Engineering, Apache Spark and More
We are only a few weeks away from Data + AI Summit, returning May 24–28. If you haven’t signed up yet, take advantage of free registration for five days of virtual engagement: training, talks, meetups, AMAs and community camaraderie. To help you navigate through hundreds of sessions, I am sharing …
How (Not) to Tune Your Model with Hyperopt
Hyperopt is a powerful tool for tuning ML models with Apache Spark. Read on to learn how to define and execute (and debug) the tuning optimally! So, you want to build a model. You’ve solved the harder problems of accessing data, cleaning it and selecting features. Now, you just need …