All the data you need.

Tag: Computing

Why not reuse passwords?
Perhaps you’ve heard that you should not reuse passwords but don’t understand why. After all, you have a really good password, one that nobody would ever guess, so why not use it everywhere? Your password is not as good as you think First of all, people who think they have …
Unix linguistics
If you knew that you wanted to learn 10 spoken languages, it would probably be helpful to take a course in linguistics first. Or maybe to have a linguistics course after learning your first or second language. And if the languages are related, it would help to know something about …
Randomized response and local differential privacy
Differential privacy protects user privacy by adding randomness as necessary to the results of queries to a database containing private data. Local differential privacy protects user privacy by adding randomness before the data is inserted to the database. Using the visualization from this post, local differential privacy takes the left …
PATE framework for differentially private machine learning
Machine learning models can memorize fragments of their training data and return these fragments verbatim. I’ve seen instances, for example, where I believe an LLM returned phrases verbatim from this site. It’s easy to imagine how medical data might leak this way. How might you prevent this? And how might …
Fax machines in the 21st century
There are still tens of millions of fax machines still exist. My business line gets calls from modems and fax machines fairly often. Maybe my number is close to that of a fax machine. Fax machines are especially common in health care. I remember when I was working at MD …
Portable sed -i across MacOS and Linux
The -i flag to ask sed to edit a file in place works differently on Linux and MacOS. If you want to create a backup of your file before you edit it, say with the extension .bak, then on Linux you would run sed -i.bak myfile but for the version …
Elliptic curve addition formulas
The geometric description of addition of points P and Q on an elliptic curve involves four logical branches: If one of P or Q is the point at infinity … Else if P = Q … Else if P and Q lie on a vertical line … Else … It …
Elliptic curve Diffie-Hellman key exchange
I concluded the previous post by saying elliptic curve Diffie-Hellman key exchange (ECDHE) requires smaller keys than finite field Diffie-Hellman (FFDHE) to obtain the same level of security. How much smaller are we talking about? According to NIST recommendations, a 256-bit elliptic curve curve provides about the same security as …
Chinese Remainder Theorem synthesis algorithm
Suppose m = pq where p and q are large, distinct primes. In the previous post we said that calculations mod m can often be carried out more efficiently by working mod p and mod q, then combining the results to get back to a result mod m. The Chinese …
RSA encrypted messages that cannot be decrypted
Not all messages encrypted with the RSA algorithm can be decrypted. This post will show why this is possible and why it does not matter in practice. RSA in a nutshell RSA encryption starts by finding two large primes, p and q. These primes are kept secret, but their product …
Checksum polynomials
A large class of checksum algorithms have the following pattern: Think of the bits in a file as the coefficients in a polynomial P(x). Divide P(x) by a fixed polynomial Q(x) mod 2 and keep the remainder. Report the remainder as a sequence of bits. In practice there’s a little …
Angles between words
Natural language processing represents words as high-dimensional vectors, on the order of 100 dimensions. For example, the glove-wiki-gigaword-50 set of word vectors contains 50-dimensional vectors, and the the glove-wiki-gigaword-200 set of word vectors contains 200-dimensional vectors. The intent is to represent words in such a way that the angle between …
Productive constraints
This post will discuss two scripting languages, but that’s not what the post is really about. It’s really about expressiveness and (or versus) productivity. *** I was excited to discover the awk programming language sometime in college because I had not used a scripting language before. Compared to C, awk …
Sort and remove duplicates
A common idiom in command line processing of text files is ... | sort | uniq | ... Some process produces lines of text. You want to pipe that text through sort to sort the lines in alphabetical order, then pass it to uniq to filter out all but the …
Swish, mish, and serf
Swish, mish, and serf are neural net activation functions. The names are fun to say, but more importantly the functions have been shown to improve neural network performance by solving the “dying ReLU problem.” Softplus can also be used as an activation function, but our interest in softplus here is …
Generating and inspecting an RSA private key
In principle you generate an RSA key by finding two large prime numbers, p and q, and computing n = pq. You could, for example, generate random numbers by rolling dice, then type the numbers into Mathematica to test each for primaility until you find a couple prime numbers of …
Date sequence from the command line
I was looking back at Jeroen Janssen’s book Data Science at the Command Line and his dseq utility caught my eye. This utility prints out a sequence of dates relative to the current date. I’ve needed this and didn’t know it. Suppose you have a CSV file and you need …
Named entity recognition
Named entity recognition (NER) is a task of natural language processing: pull out named things text. It sounds like trivial at first. Just create a giant list of named things and compare against that. But suppose, for example, University of Texas is on your list. If Texas is also on …