All the data you need.

Tag: Spark

What’s New in Apache Spark™ 3.1 Release for Structured Streaming
Along with providing the ability for streaming processing based on Spark Core and SQL API, Structured Streaming is one of the most important components for Apache Spark™. In this blog post, we summarize the notable improvements for Spark Streaming in the latest 3.1 release, including a new streaming table API, …
A Guide to Data + AI Summit Sessions: Machine Learning, Data Engineering, Apache Spark and More
We are only a few weeks away from Data + AI Summit, returning May 24–28. If you haven’t signed up yet, take advantage of free registration for five days of virtual engagement: training, talks, meetups, AMAs and community camaraderie. To help you navigate through hundreds of sessions, I am sharing …
How (Not) to Tune Your Model with Hyperopt
Hyperopt is a powerful tool for tuning ML models with Apache Spark. Read on to learn how to define and execute (and debug) the tuning optimally! So, you want to build a model. You’ve solved the harder problems of accessing data, cleaning it and selecting features. Now, you just need …
Efficiently Building ML Models for Predictive Maintenance in the Oil and Gas Industry With Databricks
Guest authored post by Halliburton’s Varun Tyagi, Data Scientist, and Daili Zhang, Principal Data Scientist, as part of the Databricks Guest Blog Program Halliburton is an oil field services company with a 100-year-long proven track record of best-in-class oilfield offerings. With operations in over 70 countries, Halliburton provides services related …
Fine-Grained Time Series Forecasting at Scale With Facebook Prophet and Apache Spark: Updated for Spark 3
Advances in time series forecasting are enabling retailers to generate more reliable demand forecasts. The challenge now is to produce these forecasts in a timely manner and at a level of granularity that allows the business to make precise adjustments to product inventories. Leveraging Apache Spark™ and Facebook Prophet, more …
On-Demand Spark clusters with GPU acceleration
Apache Spark has become the de-facto standard for processing large amounts of stationary and streaming data in a distributed fashion. The addition of the MLlib library, consisting of common learning algorithms and utilities, opened up Spark for a wide range of machine learning tasks and paved the way for running …
Advertising Fraud Detection at Scale at T-Mobile
This is a guest authored post by Data Scientist Eric Yatskowitz and Data Engineer Phan Chuong, T-Mobile Marketing Solutions. The world of online advertising is a large and complex ecosystem rife with fraudulent activity such as spoofing, ad stacking and click injection. Estimates show that digital advertisers lost about $23 …
Segmentation in the Age of Personalization
Quick link to notebooks referenced through this post. Personalization is heralded as the gold standard of customer engagement. Organizations successfully personalizing their digital experiences are cited as driving 5 to 15% higher revenues and 10 to 30% greater returns on their marketing spend. And now many customer experience leaders are …
Glow V1.0.0, Next Generation Genome Wide Analytics
Genomics data has exploded in recent years, especially as some datasets, such as the UK Biobank, become freely available to researchers anywhere. Genomics data is leveraged for high-impact use cases – gene discovery, research and development prioritization, and to conduct randomized controlled trials. These use cases will help in developing …
Upgrade Production Workloads to Be Safer, Easier, and Faster With Databricks Runtime 7.3 LTS
What a difference a year makes. One year ago, Databricks Runtime version (DBR) 6.4 was released — followed by 8 more DBR releases. But now it’s time to plan for an upgrade to 7.3 for Long-Term Support (LTS) and compatibility, as support for DBR 6.4 will end on April 1, …
Analyzing Algorand Blockchain Data With Databricks Delta (Part 2)
This post was written in collaboration betweeen Eric Gieseke, principal software engineer at Algorand, and Anindita Mahapatra, solutions architect, Databricks. Algorand is a public, decentralized blockchain system that uses a proof of stake consensus protocol. It is fast and energy efficient, with a transaction commit time under five seconds and …
Introducing Apache Spark™ 3.1
We are excited to announce the availability of Apache Spark 3.1 on Databricks as part of Databricks Runtime 8.0. We want to thank the Apache Spark™ community for all their valuable contributions to the Spark 3.1 release. Continuing with the objectives to make Spark faster, easier and smarter, Spark 3.1 …
Amplify Insights into Your Industry With Geospatial Analytics
Data science is becoming commonplace and most companies are leveraging analytics and business intelligence to help make data-driven business decisions. But are you supercharging your analytics and decision-making with geospatial data? Location intelligence, and specifically geospatial analytics, can help uncover important regional trends and behavior that impact your business. This …
Strategies for Modernizing Investment Data Platforms
The appetite for investment was at a historic high in 2020 for both individual and institutional investors. One study showed that “retail traders make up nearly 25% of the stock market following COVID-driven volatility”. Moreover, institutional investors have piled on investments in cryptocurrency, with 36% invested in cryptocurrency, as outlined …
Burning Through Electronic Health Records in Real Time With Smolder
In previous blogs, we looked at two separate workflows for working with patient data coming out of an electronic health record (EHR). In those workflows, we focused on a historical batch extract of EHR data. However, in the real world, data is continuously inputted into an EHR. For many of …
How to Manage Python Dependencies in PySpark
Controlling the environment of an application is often challenging in a distributed computing environment – it is difficult to ensure all nodes have the desired environment to execute, it may be tricky to know where the user’s code is actually running, and so on. Apache Spark™ provides several standard ways …
Natively Query Your Delta Lake With Scala, Java, and Python
Today, we’re happy to announce that you can natively query your Delta Lake with Scala and Java (via the Delta Standalone Reader) and Python (via the Delta Rust API). Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, …
A Step-by-step Guide for Debugging Memory Leaks in Spark Applications
This is a guest authored post by Shivansh Srivastava, software engineer, Disney Streaming Services. Just a bit of context We at Disney Streaming Services use Apache Spark across the business and Spark Structured Streaming to develop our pipelines. These applications run on the Databricks Runtime(DBR) environment which is quite user-friendly. …