All the data you need.

Tag: Code

Density-Based Clustering
Original content by Manojit Nandi – Updated by Josh Poduska Cluster Analysis is an important problem in data analysis. Data scientists use clustering to identify malfunctioning servers, group genes with similar expression patterns, and perform various other applications. There are many families of data clustering algorithms, and you may be …
Bringing ML to Agriculture: Transforming a Millennia-old Industry
Guest post by Jeff Melching, Distinguished Engineer / Chief Architect Data & Analytics At The Climate Corporation, we aim to help farmers better understand their operations and make better decisions to increase their crop yields in a sustainable way. We’ve developed a model-driven software platform, called Climate FieldView™, that captures, …
The curse of Dimensionality
Guest Post by Bill Shannon, Founder and Managing Partner of BioRankings Danger of Big Data Big data is the rage. This could be lots of rows (samples) and few columns (variables) like credit card transaction data, or lots of columns (variables) and few rows (samples) like genomic sequencing in life …
Providing fine-grained, trusted access to enterprise datasets with Okera and Domino
Domino and Okera – Provide data scientists access to trusted datasets within reproducible and instantly provisioned computational environments. In the last few years, we’ve seen the acceleration of two trends — the increasing amounts of data stored and utilized by organizations, and the subsequent need for data scientists to help …
The importance of structure, coding style, and refactoring in notebooks
Notebooks are increasingly crucial in the data scientist’s toolbox. Although considered relatively new, their history traces back to systems like Mathematica and MATLAB. This form of interactive workflow was introduced to assist data scientists in documenting their work, facilitating reproducibility, and prompting collaboration with their team members. Recently there has …
Evaluating Ray: Distributed Python for Massive Scalability
Dean Wampler provides a distilled overview of Ray, an open source system for scaling Python systems from single machines to large clusters. If you are interested in additional insights, register for the upcoming Ray Summit. Introduction This post is for people making technology decisions, by which I mean data science …
Data Drift Detection for Image Classifiers
This article covers how to detect data drift for models that ingest image data as their input in order to prevent their silent degradation in production. Run the example in a complementary Domino project. Introduction: preventing silent model degradation in production In the real word, data is recorded by different …
Techniques for Collecting, Prepping, and Plotting Data: Predicting Social Media-Influence in the NBA
This article provides insight on the mindset, approach, and tools to consider when solving a real-world ML problem. It covers questions to consider as well as collecting, prepping and plotting data. A complementary Domino project is available. Introduction Collecting and prepping data are core research tasks. While the most ideal …
Clustering in R
This article covers clustering including K-means and hierarchical clustering. A complementary Domino project is available. Introduction Clustering is a machine learning technique that enables researchers and data scientists to partition and segment data. Segmenting data into appropriate groups is a core task when conducting exploratory analysis. As Domino seeks to …
Time Series with R
This article delves into methods for analyzing multivariate and univariate time series data. A complementary Domino project is available. Introduction Conducting exploratory analysis and extracting meaningful insights from data are core components of research and data science work. Time series data is commonly encountered. We see it when working with …
Exploring US Real Estate Values with Python
This post covers data exploration using machine learning and interactive plotting. If interested in running the examples, there is a complementary Domino project available. Introduction Models are at the heart of data science. Data exploration is vital to model development and is particularly important at the start of any data …
Natural Language in Python using spaCy: An Introduction
This article provides a brief introduction to natural language using spaCy and related libraries in Python. The complementary Domino project is also available. Introduction This article and paired Domino project provide a brief introduction to working with natural language (sometimes called “text analytics”) in Python using spaCy and related libraries. …
HyperOpt: Bayesian Hyperparameter Optimization
This article covers how to perform hyperparameter optimization using a sequential model-based optimization (SMBO) technique implemented in the HyperOpt Python package. There is a complementary Domino project available. Introduction Feature engineering and hyperparameter optimization are two important model building steps. Over the years, I have debated with many colleagues as …
Deep Reinforcement Learning
This article provides an excerpt “Deep Reinforcement Learning” from the book, Deep Learning Illustrated by Krohn, Beyleveld, and Bassens. The article includes an overview of reinforcement learning theory with focus on the deep Q-learning. It also covers using Keras to construct a deep Q-learning network that learns within a simulated …
Towards Predictive Accuracy: Tuning Hyperparameters and Pipelines
This article provides an excerpt of “Tuning Hyperparameters and Pipelines” from the book, Machine Learning with Python for Everyone by Mark E. Fenner. The excerpt and complementary Domino project evaluates hyperparameters including GridSearch and RandomizedSearch as well as building an automated ML workflow. Introduction Data scientists, machine learning (ML) researchers, …
Deep Learning Illustrated: Building Natural Language Processing Models
Many thanks to Addison-Wesley Professional for providing the permissions to excerpt “Natural Language Processing” from the book, Deep Learning Illustrated by Krohn, Beyleveld, and Bassens. The excerpt covers how to create word vectors and utilize them as an input into a deep learning model. A complementary Domino project is available. …
Manual Feature Engineering
Many thanks to AWP Pearson for the permission to excerpt “Manual Feature Engineering: Manipulating Data for Fun and Profit” from the book, Machine Learning with Python for Everyone by Mark E. Fenner. There is also a complementary Domino project available. Introduction Many data scientists deliver value to their organizations by …
A Practitioner’s Guide to Deep Learning with Ludwig
Joshua Poduska provides a distilled overview of Ludwig including when to use Ludwig’s command-line syntax and when to use its Python API. Introduction New tools are constantly being added to the deep learning ecosystem. It can be fun and informative to look for trends in the type of tools being …