All the data you need.

Tag: Spark

Modernizing Risk Management Part 2: Aggregations, Backtesting at Scale and Introducing Alternative Data
Understanding and mitigating risk is at the forefront of any financial services institution. However, as previously discussed in the first blog of this two-part series, banks today are still struggling to keep up with the emerging risks and threats facing their business. Plagued by the limitations of on-premises infrastructure and …
Customer Lifetime Value Part 1: Estimating Customer Lifetimes
NOTEBOOK LINK HERE The biggest challenge every marketer faces is how to best spend money to profitably grow their brand. We want to spend our marketing dollars on activities that attract the best customers, while avoiding spending on unprofitable customers or on activities that erode brand equity. Too often, marketers …
Monitor Your Databricks Workspace with Audit Logs
Cloud computing has fundamentally changed how companies operate – users are no longer subject to the restrictions of on-premises hardware deployments such as physical limits of resources and onerous environment upgrade processes. With the convenience and flexibility of cloud services comes challenges on how to properly monitor how your users …
How to access S3 data from Spark
Getting data from an AWS S3 bucket is as easy as configuring your Spark cluster.So you’ve decided you want to start writing a Spark job to process data. You’ve got your cluster created on AWS, Spark installed on those instances and you’ve even identified what data you want to use …
Vectorized R I/O in Upcoming Apache Spark 3.0
R is one of the most popular computer languages in data science, specifically dedicated to statistical analysis with a number of extensions, such as RStudio addins and other R packages, for data processing and machine learning tasks. Moreover, it enables data scientists to easily visualize their data set. By using …
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
This is a joint engineering effort between the Databricks Apache Spark engineering team — Wenchen Fan, Herman van Hovell and MaryAnn Xue — and the Intel engineering team — Ke Jia, Haifeng Chen and Carson Wang. See the AQE notebook to demo the solution covered below Over the years, there’s …
Modernizing Risk Management Part 1: Streaming data-ingestion, rapid model development and Monte-Carlo Simulations at Scale
Managing risk within the financial services, especially within the banking sector, has increased in complexity over the past several years. First, new frameworks (such as FRTB) are being introduced that potentially require tremendous computing power and an ability to analyze years of historical data. At the same, regulators are demanding …
Getting started with Spark and batch processing frameworks
Getting started with Spark & batch processing frameworksWhat you need to know before diving into big data processing with Apache Spark and other frameworks.When I was an Insight Data Engineering Fellow in 2016, I knew very little about Apache Spark prior to starting the program. Worse, documentation seemed sparse and …
New Pandas UDFs and Python Type Hints in the Upcoming Release of Apache Spark 3.0™
Pandas user-defined functions (UDFs) are one of the most significant enhancements in Apache Spark for data science. They bring many benefits, such as enabling users to use Pandas APIs and improving performance. However, Pandas UDFs have evolved organically over time, which has led to some inconsistencies and is creating confusion …
Manage and Scale Machine Learning Models for IoT Devices
A common data science internet of things (IoT) use case involves training machine learning models on real-time data coming from an army of IoT sensors. Some use cases demand that each connected device has its own individual model since many basic machine learning algorithms often outperform a single complex model. …
Schema Evolution in Merge Operations and Operational Metrics in Delta Lake
Try this notebook to reproduce the steps outlined below We recently announced the release of Delta Lake 0.6.0, which introduces schema evolution and performance improvements in merge and operational metrics in table history. The key features in this release are: Support for schema evolution in merge operations (#170) – You …
Fighting Cyber Threats in the Public Sector with Scalable Analytics and AI
Watch our on-demand webinar Real-time Threat Detection, Analytics and AI in the Public Sector to learn more and see a live demo. In 2019, there were 7,098 data breaches exposing over 15.1 billion records. That equates to a cyber incident every hour and fifteen minutes. The Public Sector is a …
Now on Databricks: A Technical Preview of Databricks Runtime 7 Including a Preview of Apache Spark 3.0
Introducing Databricks Runtime 7.0 Beta We’re excited to announce that the Apache Spark 3.0.0-preview2 release is available on Databricks as part of our new Databricks Runtime 7.0 Beta. The 3.0.0-preview2 release is the culmination of tremendous contributions from the open-source community to deliver new capabilities, performance gains and expanded compatibility …
How to build a Quality of Service (QoS) analytics solution for streaming video services
The Importance of Quality to Streaming Video Services Databricks QoS Solution Overview Video QoS Solution Architecture Making Your Data Ready for Analytics Creating the Dashboard / Virtual Network Operations Center Creating (Near) Real Time Alerts Next steps: Machine learning Getting Started with the Databricks Streaming Video Solution The post How …
Faster SQL Queries on Delta Lake with Dynamic File Pruning
There are two time-honored optimization techniques for making queries run faster in data systems: process data at a faster rate or simply process less data by skipping non-relevant data. This blog post introduces Dynamic File Pruning (DFP), a new data-skipping technique enabled by default in Databricks Runtime 6.1, which can …
How a Fresh Approach to Safety Stock Analysis Can Optimize Inventory
Refer to the accompanying notebook for more details. A manufacturer is working on an order for a customer only to find that the delivery of a critical part is delayed by a supplier. A retailer experiences a spike in demand for beer thanks to an unforeseen reason, and they lose …
Spark + AI Summit is now a global virtual event
Extraordinary times call for extraordinary measures. That’s why we transformed this year’s Spark + AI Summit into a fully virtual experience and opened the doors to welcome everyone, free of charge. This gives us the opportunity to turn Summit into a truly global event, bringing together tens of thousands of …
COVID-19 Datasets Now Available on Databricks: How the Data Community Can Help
With the massive disruption of the current COVID-19 pandemic, many data engineers and data scientists are asking themselves “How can the data community help?” The data community is already doing some amazing work in a short amount of time including (but certainly not limited to) one of the most commonly …