All the data you need.

Tag: Practical Techniques

Data Exploration with Pandas Profiler and D-Tale
In this blog post we cover the use of Pandas Profiler and D-Tale for Exploratory Data Analysis. The post Data Exploration with Pandas Profiler and D-Tale appeared first on Data Science Blog by Domino.
Building a Speaker Recognition Model
The ability of a system to recognize a person by their voice is a non-intrusive way to collect their biometric information. Unlike fingerprint detection, retinal scans or face recognition, speaker recognition just uses a microphone to record a person’s voice thereby circumventing the need for expensive hardware. Moreover, in pandemic …
Fundamentals of Signal Processing
Basics of digital signal processing A signal is defined as any physical quantity that varies with time, space or any other independent variables. Furthermore, most of the signals encountered in science and engineering are analog in nature. That is, the signals are functions of continuous variables, such as time or …
Accelerating model velocity through Snowflake Java UDF integration
Integrating Domino and Snowflake and using in-database machine learning / data processing techniques via user defined functions (UDF). The post Accelerating model velocity through Snowflake Java UDF integration appeared first on Data Science Blog by Domino.
ML internals: Synthetic Minority Oversampling (SMOTE) Technique
In this article we discuss why fitting models on imbalanced datasets is problematic, and how class imbalance is typically addressed. We present the inner workings of the SMOTE algorithm and show a simple “from scratch” implementation of SMOTE. We use an artificially constructed imbalance dataset (based on Iris) to generate …
Credit Card Fraud Detection using XGBoost, SMOTE, and threshold moving
In this article, we’ll discuss the challenge organizations face around fraud detection, how machine learning can be used to identify and spot anomalies that the human eye might not catch. We’ll use a gradient boosting technique via XGBoost to create a model and I’ll walk you through steps you can …
Ray for Data Science: Distributed Python tasks at scale
In this article, Dr Dean Wampler provides an overview of Ray including raising the question of why we need it. The article covers practical techniques and some walk through code to help users get started. The post Ray for Data Science: Distributed Python tasks at scale appeared first on Data …
Enterprise-class NLP with spaCy v3
Introducing the latest features included in spaCy 3.0 including transformer pipelines that bring it's NLP capabilites up to state of the art standard
How to supercharge data exploration with Pandas Profiling
Producing insights from raw data is a time-consuming process. Predictive modeling efforts rely on dataset profiles, whether consisting of summary statistics or descriptive charts. Pandas Profiling, an open-source tool leveraging Pandas Dataframes, is a tool that can simplify and accelerate such tasks. This blog explores the challenges associated with doing …
Snowflake and Domino: Better Together
Introduction Arming data science teams with the access and capabilities needed to establish a two-way flow of information is one critical challenge many organizations face when it comes to unlocking value from their modeling efforts. Part of this challenge is that many organizations seek to align their data science workflows …
PyCaret 2.2: Efficient Pipelines for Model Development
Data science is an exciting field, but it can be intimidating to get started, especially for those new to coding. Even for experienced developers and data scientists, the process of developing a model could involve stringing together many steps from many packages, in ways that might not be as elegant …
Faster data exploration in Jupyter through Lux
Notebooks have become one of the key primary tools for many data scientists. They offer a clear way to collaborate with others throughout the process of data exploration, feature engineering and model fitting and through utilizing some clear best practices, can also become living documents of how that code operates. …
Performing Non-Compartmental Analysis with Julia and Pumas AI
When analysing pharmacokinetic data to determine the degree of exposure of a drug and associated pharmacokinetic parameters (e.g., clearance, elimination half-life, maximum observed concentration (), time where the maximum concentration was observed (), Non-Compartmental Analysis (NCA) is usually the preferred approach [1]. At its core, NCA is based on applying …
Density-Based Clustering
Original content by Manojit Nandi – Updated by Josh Poduska Cluster Analysis is an important problem in data analysis. Data scientists use clustering to identify malfunctioning servers, group genes with similar expression patterns, and perform various other applications. There are many families of data clustering algorithms, and you may be …
Analyzing Large P Small N Data – Examples from Microbiome
Guest Post by Bill Shannon, Founder and Managing Partner of BioRankings Introduction High throughput screening technologies have been developed to measure all the molecules of interest in a sample in a single experiment (e.g., the entire genome, the amounts of metabolites, the composition of the microbiome). These technologies have been …
Bringing ML to Agriculture: Transforming a Millennia-old Industry
Guest post by Jeff Melching, Distinguished Engineer / Chief Architect Data & Analytics At The Climate Corporation, we aim to help farmers better understand their operations and make better decisions to increase their crop yields in a sustainable way. We’ve developed a model-driven software platform, called Climate FieldView™, that captures, …
The curse of Dimensionality
Guest Post by Bill Shannon, Founder and Managing Partner of BioRankings Danger of Big Data Big data is the rage. This could be lots of rows (samples) and few columns (variables) like credit card transaction data, or lots of columns (variables) and few rows (samples) like genomic sequencing in life …
Providing fine-grained, trusted access to enterprise datasets with Okera and Domino
Domino and Okera – Provide data scientists access to trusted datasets within reproducible and instantly provisioned computational environments. In the last few years, we’ve seen the acceleration of two trends — the increasing amounts of data stored and utilized by organizations, and the subsequent need for data scientists to help …