All the data you need.

Tag: Spark

Introducing SQL User-Defined Functions
A user-defined function (UDF) is a means for a user to extend the native capabilities of Apache Spark™ SQL. Spark SQL has supported external user-defined functions written in Scala, Java, Python and R programming languages since 1.3.0. While external UDFs are very powerful, they also come with a few caveats: …
Introducing Apache Spark™ 3.2
We are excited to announce the availability of Apache Spark™ 3.2 on Databricks as part of Databricks Runtime 10.0. We want to thank the Apache Spark community for their valuable contributions to the Spark 3.2 release. The number of monthly maven downloads of Spark has rapidly increased to 20 million. …
How Incremental ETL Makes Life Simpler With Data Lakes
Incremental ETL (Extract, Transform and Load) in a conventional data warehouse has become commonplace with CDC (change data capture) sources, but scale, cost, accounting for state and the lack of machine learning access make it less than ideal. In contrast, incremental ETL in a data lake hasn’t been possible due …
How We Built Databricks on Google Kubernetes Engine (GKE)
Our release of Databricks on Google Cloud Platform (GCP) was a major milestone toward a unified data, analytics and AI platform that is truly multi-cloud. Databricks on GCP, a jointly-developed service that allows you to store all of your data on a simple, open lakehouse platform, is based on standard …
An Experimentation Pipeline for Extracting Topics From Text Data Using PySpark
This post is part of a series of posts on topic modeling. Topic modeling is the process of extracting topics from a set of text documents. This is useful for understanding or summarizing large collections of text documents. A document can be a line of text, a paragraph or a …
The Delta Between ML Today and Efficient ML Tomorrow
Delta Lake and MLflow both come up frequently in conversation but often as two entirely separate products. This blog will focus on the synergies between Delta Lake and MLflow for machine learning use cases and explain how you can leverage Delta Lake to deliver strong ML results based on solid …
AML Solutions at Scale Using Databricks Lakehouse Platform
Anti-Money Laundering (AML) compliance has been undoubtedly one of the top agenda items for regulators providing oversight of financial institutions across the globe. As AML evolved and became more sophisticated over the decades, so have the regulatory requirements designed to counter modern money laundering and terrorist financing schemes. The Bank …
Get Your Free Copy of Delta Lake: The Definitive Guide (Early Release)
At the Data + AI Summit, we were thrilled to announce the early release of Delta Lake: The Definitive Guide, published by O’Reilly. The guide teaches how to build a modern lakehouse architecture that combines the performance, reliability and data integrity of a warehouse with the flexibility, scale and support …
Don’t Miss These Top 10 Announcements From Data + AI Summit
The 2021 Data + AI Summit was filled with so many exciting announcements for open source and Databricks, talks from top-tier creators across the industry (such as Rajat Monga, co-creator of TensorFlow) and guest luminaries like Bill Nye, Malala Yousafzai and the NASA Mars Rover team. You can watch the …
What’s New in Apache Spark™ 3.1 Release for Structured Streaming
Along with providing the ability for streaming processing based on Spark Core and SQL API, Structured Streaming is one of the most important components for Apache Spark™. In this blog post, we summarize the notable improvements for Spark Streaming in the latest 3.1 release, including a new streaming table API, …
A Guide to Data + AI Summit Sessions: Machine Learning, Data Engineering, Apache Spark and More
We are only a few weeks away from Data + AI Summit, returning May 24–28. If you haven’t signed up yet, take advantage of free registration for five days of virtual engagement: training, talks, meetups, AMAs and community camaraderie. To help you navigate through hundreds of sessions, I am sharing …
How (Not) to Tune Your Model with Hyperopt
Hyperopt is a powerful tool for tuning ML models with Apache Spark. Read on to learn how to define and execute (and debug) the tuning optimally! So, you want to build a model. You’ve solved the harder problems of accessing data, cleaning it and selecting features. Now, you just need …
Efficiently Building ML Models for Predictive Maintenance in the Oil and Gas Industry With Databricks
Guest authored post by Halliburton’s Varun Tyagi, Data Scientist, and Daili Zhang, Principal Data Scientist, as part of the Databricks Guest Blog Program Halliburton is an oil field services company with a 100-year-long proven track record of best-in-class oilfield offerings. With operations in over 70 countries, Halliburton provides services related …
Fine-Grained Time Series Forecasting at Scale With Facebook Prophet and Apache Spark: Updated for Spark 3
Advances in time series forecasting are enabling retailers to generate more reliable demand forecasts. The challenge now is to produce these forecasts in a timely manner and at a level of granularity that allows the business to make precise adjustments to product inventories. Leveraging Apache Spark™ and Facebook Prophet, more …
On-Demand Spark clusters with GPU acceleration
Apache Spark has become the de-facto standard for processing large amounts of stationary and streaming data in a distributed fashion. The addition of the MLlib library, consisting of common learning algorithms and utilities, opened up Spark for a wide range of machine learning tasks and paved the way for running …
Advertising Fraud Detection at Scale at T-Mobile
This is a guest authored post by Data Scientist Eric Yatskowitz and Data Engineer Phan Chuong, T-Mobile Marketing Solutions. The world of online advertising is a large and complex ecosystem rife with fraudulent activity such as spoofing, ad stacking and click injection. Estimates show that digital advertisers lost about $23 …
Segmentation in the Age of Personalization
Quick link to notebooks referenced through this post. Personalization is heralded as the gold standard of customer engagement. Organizations successfully personalizing their digital experiences are cited as driving 5 to 15% higher revenues and 10 to 30% greater returns on their marketing spend. And now many customer experience leaders are …
Glow V1.0.0, Next Generation Genome Wide Analytics
Genomics data has exploded in recent years, especially as some datasets, such as the UK Biobank, become freely available to researchers anywhere. Genomics data is leveraged for high-impact use cases – gene discovery, research and development prioritization, and to conduct randomized controlled trials. These use cases will help in developing …