All the data you need.

Tag: Practical Techniques

The importance of structure, coding style, and refactoring in notebooks
Notebooks are increasingly crucial in the data scientist’s toolbox. Although considered relatively new, their history traces back to systems like Mathematica and MATLAB. This form of interactive workflow was introduced to assist data scientists in documenting their work, facilitating reproducibility, and prompting collaboration with their team members. Recently there has …
Data Drift Detection for Image Classifiers
This article covers how to detect data drift for models that ingest image data as their input in order to prevent their silent degradation in production. Run the example in a complementary Domino project. Introduction: preventing silent model degradation in production In the real word, data is recorded by different …
Model Interpretability: The Conversation Continues
This Domino Data Science Field Note covers a proposed definition of interpretability and distilled overview of the PDR framework. Insights are drawn from Bin Yu, W. James Murdoch, Chandan Singh, Karl Kumber, and Reza Abbasi-Asi’s recent paper, “Definitions, methods, and applications in interpretable machine learning”. Introduction Model interpretability continues to …
On Being Model-driven: Metrics and Monitoring
This article covers a couple of key Machine Learning (ML) vital signs to consider when tracking ML models in production to ensure model reliability, consistency and performance in the future. Many thanks to Don Miner for collaborating with Domino on this article. For additional vital signs and insight beyond what …
Clustering in R
This article covers clustering including K-means and hierarchical clustering. A complementary Domino project is available. Introduction Clustering is a machine learning technique that enables researchers and data scientists to partition and segment data. Segmenting data into appropriate groups is a core task when conducting exploratory analysis. As Domino seeks to …
Understanding Causal Inference
This article covers causal relationships and includes a chapter excerpt from the book Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications by Andrew Kelleher and Adam Kelleher. A complementary Domino project is available. Introduction As data science work is experimental and probabilistic in nature, data scientists …
Time Series with R
This article delves into methods for analyzing multivariate and univariate time series data. A complementary Domino project is available. Introduction Conducting exploratory analysis and extracting meaningful insights from data are core components of research and data science work. Time series data is commonly encountered. We see it when working with …
Exploring US Real Estate Values with Python
This post covers data exploration using machine learning and interactive plotting. If interested in running the examples, there is a complementary Domino project available. Introduction Models are at the heart of data science. Data exploration is vital to model development and is particularly important at the start of any data …
Natural Language in Python using spaCy: An Introduction
This article provides a brief introduction to natural language using spaCy and related libraries in Python. The complementary Domino project is also available. Introduction This article and paired Domino project provide a brief introduction to working with natural language (sometimes called “text analytics”) in Python using spaCy and related libraries. …
HyperOpt: Bayesian Hyperparameter Optimization
This article covers how to perform hyperparameter optimization using a sequential model-based optimization (SMBO) technique implemented in the HyperOpt Python package. There is a complementary Domino project available. Introduction Feature engineering and hyperparameter optimization are two important model building steps. Over the years, I have debated with many colleagues as …
Deep Reinforcement Learning
This article provides an excerpt “Deep Reinforcement Learning” from the book, Deep Learning Illustrated by Krohn, Beyleveld, and Bassens. The article includes an overview of reinforcement learning theory with focus on the deep Q-learning. It also covers using Keras to construct a deep Q-learning network that learns within a simulated …
Towards Predictive Accuracy: Tuning Hyperparameters and Pipelines
This article provides an excerpt of “Tuning Hyperparameters and Pipelines” from the book, Machine Learning with Python for Everyone by Mark E. Fenner. The excerpt and complementary Domino project evaluates hyperparameters including GridSearch and RandomizedSearch as well as building an automated ML workflow. Introduction Data scientists, machine learning (ML) researchers, …
Deep Learning Illustrated: Building Natural Language Processing Models
Many thanks to Addison-Wesley Professional for providing the permissions to excerpt “Natural Language Processing” from the book, Deep Learning Illustrated by Krohn, Beyleveld, and Bassens. The excerpt covers how to create word vectors and utilize them as an input into a deep learning model. A complementary Domino project is available. …
A Practitioner’s Guide to Deep Learning with Ludwig
Joshua Poduska provides a distilled overview of Ludwig including when to use Ludwig’s command-line syntax and when to use its Python API. Introduction New tools are constantly being added to the deep learning ecosystem. It can be fun and informative to look for trends in the type of tools being …
Themes and Conferences per Pacoid, Episode 11
Paco Nathan‘s latest article covers program synthesis, AutoPandas, model-driven data queries, and more. Introduction Welcome back to our monthly burst of themespotting and conference summaries. BTW, videos for Rev2 are up: https://rev.dominodatalab.com/rev-2019/ On deck this time ’round the Moon: program synthesis. In other words, using metadata about data science work …
MNIST Expanded: 50,000 New Samples Added
This post provides a distilled overview regarding the rediscovery of 50,000 samples within the MNIST dataset. MNIST: The Potential Danger of Overfitting Recently, Chhavi Yadav (NYU) and Leon Bottou (Facebook AI Research and NYU) indicated in their paper, “Cold Case: The Lost MNIST Digits”, how they reconstructed the MNIST (Modified …
Addressing Irreproducibility in the Wild
This Domino Data Science Field Note provides highlights and excerpted slides from Chloe Mawer’s “The Ingredients of a Reproducible Machine Learning Model” talk at a recent WiMLDS meetup. Mawer is a Principal Data Scientist at Lineage Logistics as well as an Adjunct Lecturer at Northwestern University. Special thanks to Mawer …
Can Data Science Help Us Make Sense of the Mueller Report?
This blog post provides insights on how to apply Natural Language Processing (NLP) techniques. A complementary Domino project is available. The Mueller Report The Mueller Report, officially known as the Report on the Investigation into Russian Interference in the 2016 Presidential Election, was recently released and gives the public more …