All the data you need.

Tag: Statistics

A Bayesian approach to proving you’re human
I set up a GitHub account for a new employee this morning and spent a ridiculous amount of time proving that I’m human. The captcha was to listen to three audio clips at a time and say which one contains bird sounds. This is a really clever test, because humans …
National identity stereotypes through generative AI
For Rest of World, Victoria Turk breaks down bias in generative AI in…Tags: AI, bias, midjourney, Rest of World
Flilpbook Experiment, like the Telephone game but visual
This looks fun. The Pudding is running an experiment that functions like a…Tags: loss, Pudding, Russell Samora, sketch
Language-based AI to chat with her dead husband
For the past few years, Laurie Anderson has been using an AI chatbot…Tags: AI, chatbot, Guardian, Large Language Model
Estimating an author’s vocabulary
How would you estimate the size of an author’s vocabulary? Suppose you have a analyzed the author’s available works and found n words, x of which are unique. Then you know the author’s vocabulary was at least x, but it’s reasonable to assume that the author may have know words …
Detecting the language of encrypted text
Imagine you are a code breaker living a century ago. You’ve intercepted a message, and you go through your bag of tricks, starting with the simplest techniques first. Maybe the message has been encrypted using a simple substitution cipher, so you start with that. Simple substitution ciphers can be broken …
Uncovering names masked with stars
Sometimes I’ll see things like my name partially concealed as J*** C*** and think “a lot of good that does.” Masking letters reveals more than people realize. For example, when you see that someone’s first name is four letters and begins with J, there’s about a 70% chance they’re male …
Love: math or magic?
This American Life tells the tales as old as time: When it comes…Tags: love, This American Life
When is less data less private?
If I give you a database, I give you every row in the database. So if you delete some rows from the database, you have less information, not more, right? This seems very simple, and it mostly is, but there are a couple subtleties. A common measure in data privacy …
How likely is a random variable to be far from its center?
There are many answers to the question in the title: How likely is a random variable to be far from its center? The answers depend on how much you’re willing to assume about your random variable. The more you can assume, the stronger your conclusion. The answers also depend on …
Two-digit zip codes
It’s common to truncate US zip codes to the first three digits for privacy reasons. Truncating to the first two digits is less common, but occurs in some data sets. HIPAA Safe Harbor requires sparse 3-digit zip codes to be suppressed; even when rolled up to three digits some regions …
DNA face to facial recognition in attempt to find suspect
In an effort to find a suspect in a 1990 murder, there was…Tags: crime, DNA, ethics, facial recognition, privacy, Wired
Beta inequality symmetries
I was thinking about the work I did when I worked in biostatistics at MD Anderson. This work was practical rather than mathematically elegant, useful in its time but not of long-term interest. However, one result came out of this work that I would call elegant, and that was a …
Coin flips might tend towards the same side they started
The classic coin flip is treated as a fair way to make decisions,…Tags: bias, coins, František Bartoš, Scientific American
The Five Safes data privacy framework
The Five Safes decision framework was created a couple decades ago by Felix Ritchie at the UK Office for National Statistics. It is a framework for evaluating the safe use of confidential data, particularly by government agencies. You can find a description of the Five Safes, for example, in NIST …
AI-based things in 2023
There were many AI-based things in 2023. Simon Willison outlined what we learned…Tags: AI, Large Language Model, Simon Willison
Estimating the size of YouTube
YouTube doesn’t offer numbers for how big they are, so Ethan Zuckerman and…Tags: estimation, Ethan Zuckerman, Jason Baumgartner, YouTube
Database reconstruction attacks
In 2018, three researchers from the US Census Bureau published a paper entitled “Understanding Database Reconstruction Attacks on Public Data.” [1] The article showed that private data on many individuals could be reverse engineered from public data. As I wrote about a few days ago, census blocks are at the …