All the data you need.

Tag: Spark

Identify Suspicious Behavior in Video with Databricks Runtime for Machine Learning
With the exponential growth of cameras and visual recordings, it is becoming increasingly important to operationalize and automate the process of video identification and categorization. Applications ranging from identifying the correct cat video to visually categorizing objects are becoming more prevalent. With millions of users around the world generating and …
Introducing Flint: A time-series library for Apache Spark
This is a joint guest community blog by Li Jin at Two Sigma and Kevin Rasmussen at Databricks; they share how to use Flint with Apache Spark. Introduction The volume of data that data scientists face these days increases relentlessly, and we now find that a traditional, single-machine solution is …
An Introduction To Machine Learning Using Spark Language
Machine learning is an upcoming field in the world of digital science, which allows you to create algorithms to make your device learn to operate on data and also to make predictions based on collected data. Machine learning course is possible through various languages like Python, Java, C++, R, etc. …
A Guide to Apache Spark Use Cases, Streaming, and Research Talks at Spark + AI Summit Europe
For much of Apache Spark’s history, its capacity to process data at scale and capability to unify disparate workloads has led Spark developers to tackle new use cases. Through innovation and extension of its ecosystem, developers combine data and AI to develop new applications. So it befits developers to come …
By Customer Demand: Databricks and Snowflake Integration
Today, we are proud to announce a partnership between Snowflake and Databricks that will help our customers further unify Big Data and AI by providing an optimized, production-grade integration between Snowflake’s built for the cloud-built data warehouse and Databricks’ Unified Analytics Platform. Over the course of the last year, our …
How to Use MLflow to Experiment a Keras Network Model: Binary Classification for Movie Reviews
In the last blog post, we demonstrated the ease with which you can get started with MLflow, an open-source platform to manage machine learning lifecycle. In particular, we illustrated a simple Keras/TensorFlow model using MLflow and PyCharm. This time we explore a binary classification Keras network model. Using MLflow’s Tracking …
New Features in MLflow v0.5.0 Release
Today, we’re excited to announce MLflow v0.5.0, which we released last week with some new features. MLflow 0.5.0 is already available on PyPI and docs are updated. If you do pip install mlflow as described in the MLflow quickstart guide, you will get the recent release. In this post, we’ll …
Introducing mlflow-apps: A Repository of Sample Applications for MLflow
Introduction This summer, I was a software engineering intern at Databricks on the Machine Learning (ML) Platform team. As part of my intern project, I built a set of MLflow apps that demonstrate MLflow’s capabilities and offer the community examples to learn from. In this blog, I’ll discuss this library …
100x Faster Bridge between Apache Spark and R with User-Defined Functions on Databricks
SparkR User-Defined Function (UDF) API opens up opportunities for big data workloads running on Apache Spark to embrace R’s rich package ecosystem. Some of our customers that have R experts on board use SparkR UDF API to blend R’s sophisticated packages into their ETL pipeline, applying transformations that go beyond …
Building a Real-Time Attribution Pipeline with Databricks Delta
In digital advertising, one of the most important things to be able to deliver to clients is information about how their advertising spend drove results. The more quickly we can provide this, the better. To tie conversions or engagements to the impressions served in an advertising campaign, companies must perform …
Loan Risk Analysis with XGBoost and Databricks Runtime for Machine Learning
For companies that make money off of interest on loans held by their customer, it’s always about increasing the bottom line. Being able to assess the risk of loan applications can save a lender the cost of holding too many risky assets. It is the data scientist’s job to run …
MLflow 0.4.2 Released
Today, we’re excited to announce MLflow v0.4.0, MLflow v0.4.1, and v0.4.2 which we released within the last week with some of the recently requested features. MLflow 0.4.2 is already available on PyPI and docs are updated. If you do pip install mlflow as described in the MLflow quickstart guide, you …
A Guide to Data Science, Developer, and Deep Dive Talks at Spark + AI Summit Europe
In October 2012, Harvard Business Review put a spotlight on the data science career with a dedicated issue and a catchy claim: Data Scientist: The Sexiest Job of the 21st Century. Last year in October, five years on, Forbes recast an answer on Quora, Why Data Science Is Such A …
Get Certified on Apache Spark™ with Databricks
In a world of rapidly changing products, companies investing in technology need well-trained experts to run it. Certifications are a key differentiator in a competitive job market because they validate your skills and expertise while keeping you relevant. In fact, certifications may impact career growth more than degrees, since business …
Processing Petabytes of Data in Seconds with Databricks Delta
Introduction Databricks Delta is a unified data management system that brings data reliability and fast analytics to cloud data lakes. In this blog post, we take a peek under the hood to examine what makes Databricks Delta capable of sifting through petabytes of data within seconds. In particular, we discuss …
rquery: Practical Big Data Transforms for R-Spark Users
This is a guest community blog from Nina Zumel and John Mount, data scientists and consultants at Win-Vector. They share how to use rquery with Apache Spark on Databricks Introduction In this blog, we will introduce rquery, a powerful query tool that allows R users to implement powerful data transformations …
Bay Area Apache Spark Meetup Summary @ Databricks HQ
On July 19, we held our monthly Bay Area Spark Meetup (BASM) at Databricks, HQ in San Francisco. At the Spark + AI Summit in June, we announced two open-source projects: Project Hydrogen and MLflow. Partly to continue sharing the progress of these open-source projects with the community and partly …
MLflow v0.3.0 Released
Today, we’re excited to announce MLflow v0.3.0, which we released last week with some of the requested features from internal clients and open source users. MLflow 0.3.0 is already available on PyPI and docs are updated. If you do pip install mlflow as described in the MLflow quickstart guide, you …