All the data you need.

Tag: Spark

Data Quality Monitoring on Streaming Data Using Spark Streaming and Delta Lake
Try this notebook to reproduce the steps outlined below In the era of accelerating everything, streaming data is no longer an outlier- instead, it is becoming the norm. We often no longer hear customers ask, “can I stream this data?” so much as “how fast can I stream this data?”, …
Introducing Databricks Ingest: Easy and Efficient Data Ingestion from Different Sources into Delta Lake
We are excited to introduce a new feature – Auto Loader – and a set of partner integrations, in a public preview, that allows Databricks users to incrementally ingest data into Delta Lake from a variety of data sources. Auto Loader is an optimized cloud file source for Apache Spark …
On-Demand Webinar: Granular Demand Forecasting At Scale
We recently hosted a live webinar — How Starbucks Forecasts Demand at Scale with Facebook Prophet and Databricks — During this webinar we learnt why Demand Forecasting is critical to Retail/ CPG firms and how it enables 22 other use cases. Brendan O’Shaughnessy, Data Science Manager at Starbucks walked us …
Free eBook: A Practical Introduction to Apache Spark
If you are a developer or data scientist interested in big data, Spark is the tool for you. Apache Spark’s ability to speed analytic applications by orders of magnitude, its versatility, and ease of use are quickly winning the market. With Spark’s appeal to developers, end-users, and integrators to solve …
Building Reliable Data Pipelines for Machine Learning Webinar Recap
This is a guest blog from Ryan Fox Squire | Product & Data Science at SafeGraph At SafeGraph we are big fans of Databricks. We use Databricks every day for ad hoc analysis, prototyping, and many of our production pipelines. SafeGraph is a data company – we sell accurate and …
Automating Digital Pathology Image Analysis with Machine Learning on Databricks
With technological advancements in imaging and the availability of new efficient computational tools, digital pathology has taken center stage in both research and diagnostic settings. Whole Slide Imaging (WSI) has been at the center of this transformation, enabling us to rapidly digitize pathology slides into high resolution images. By making …
Query Delta Lake Tables from Presto and Athena, Improved Operations Concurrency, and Merge performance
We are excited to announce the release of Delta Lake 0.5.0, which introduces Presto/Athena support and improved concurrency. The key features in this release are: Support for other processing engines using manifest files (#76) – You can now query Delta tables from Presto and Amazon Athena using manifest files, which …
On-Demand Webinar: Geospatial Analytics and AI in the Public Sector
We recently hosted a live webinar — Geospatial Analytics and AI in Public Sector — during which we covered top geospatial analysis use cases in the Public Sector along with live demos showcasing how to build scalable analytics and machine learning pipelines on geospatial data at sale. Geospatial Analytics Webinar …
Fine-Grained Time Series Forecasting At Scale With Facebook Prophet And Apache Spark
Advances in time series forecasting are enabling retailers to generate more reliable demand forecasts. The challenge now is to produce these forecasts in a timely manner and at a level of granularity that allows the business to make precise adjustments to product inventories. Leveraging Apache Spark™ and Facebook Prophet, more …
Solving the World’s Toughest Problems with the Growing Open Source Ecosystem and Databricks
We started Databricks in 2013 in a tiny little office in Berkeley with the belief that data has the potential to solve the world’s toughest problems. We entered 2020 as a global organization with over 1000 employees and a customer base spanning from two-person startups to Fortune 10s. In this …
Better Machine Learning through Active Learning
Try this notebook to reproduce the steps outlined below Machine learning models can seem like magical savants. They can distinguish hot dogs from not-hot-dogs, but that’s long since an easy trick. My aunt’s parrot can do that too. But machine-learned models power voice-activated assistants that effortlessly understand noisy human speech, …
Processing Geospatial Data at Scale With Databricks
The evolution and convergence of technology has fueled a vibrant marketplace for timely and accurate geospatial data. Every day billions of handheld and IoT devices along with thousands of airborne and satellite remote sensing platforms generate hundreds of exabytes of location-aware data. This boom of geospatial big data combined with …
Streamlining Variant Normalization on Large Genomic Datasets with Glow
Cross posted from the Glow blog. Many research and drug development projects in the genomics world involve large genomic variant data sets, the volume of which has been growing exponentially over the past decade. However, the tools to extract, transform, load (ETL) and analyze these data sets have not kept …
Migration from Hadoop to modern cloud platforms: The case for Hadoop alternatives
Companies rely on their big data and analytics platforms to support innovation and digital transformation strategies. However, many Hadoop users struggle with complexity, unscalable infrastructure, excessive maintenance overhead and overall, unrealized value. We help customers navigate their Hadoop migrations to modern cloud platforms such as Databricks and our partner products …
Deep Learning Tutorial Demonstrates How to Simplify Distributed Deep Learning Model Inference Using Delta Lake and Apache Spark™
On October 10th, our team hosted a live webinar—Simple Distributed Deep Learning Model Inference—with Xiangrui Meng, Software Engineer at Databricks. Model inference, unlike model training, is usually embarrassingly parallel and hence simple to distribute. However, in practice, complex data scenarios and compute infrastructure often make this “simple” task hard to …
Using AutoML Toolkit’s FamilyRunner Pipeline APIs to Simplify and Automate Loan Default Predictions
Introduction In the post Using AutoML Toolkit to Automate Loan Default Predictions, we had shown how the Databricks Labs’ AutoML Toolkit simplified Machine Learning model feature engineering and model building optimization (MBO). It also had improved the area-under-the-curve (AUC) from 0.6732 (handmade XGBoost model) to 0.723 (AutoML XGBoost model). With …
Celebrating Growth at Databricks and 1,000 Employees!
This November, Databricks hired our 1,000th full-time employee! Founded in Berkeley in 2013, our six co-founders created Databricks to help data teams solve the world’s toughest problems – and since then, we’ve grown tremendously! Not only have we had some major milestones like our Microsoft partnership, resulting in Azure Databricks …
Scalable near real-time S3 access logging analytics with Apache Spark and Delta Lake
The original blog is from Viacheslav Inozemtsev, Senior Data Engineer at Zalando, reproduced with permission. Introduction Many organizations use AWS S3 as their main storage infrastructure for their data. Moreover, by using Apache Spark™ on Databricks they often perform transformations of that data and save the refined results back to …