(More) Statistics Without the Agonizing Pain: Probability Distributions

One of my favorite conference talks of all time is Statistics Without the Agonizing Pain, by John Rauser, who was at that time head of data science at Pinterest. In this talk, he explains the statistical argument underpinning the Student’s t-Test in simple, approachable terms using an unforgettable example involving mosquitoes and beer. It’s about 15 minutes long, and well worth your time, if you haven’t watched it before.

After watching that video, I realized that—like most things—statistics is complex but ultimately straightforward once you understand the underlying ideas. The problem is that the modern approach to teaching statistics often gets in the way of that understanding. Historically, statistical methods were designed for a world where all computation had to be done by hand, so they were optimized to minimize calculation, not to maximize clarity or intuition. That design choice still has value today—efficient algorithms make modern statistical programs fast and practical. But we continue to teach statistics as if computation were still the bottleneck, even though we now all carry supercomputers in our pockets. Seen in that light, it’s obvious that the way we teach statistics has not kept up with the way we practice statistics.

To that end, I thought I’d write down some things I’ve learned about statistics over the years in a way that I hope is clearer than the average statistical textbook, mostly so I don’t forget them, but in the hopes that maybe they’ll be useful to others, too.

Continue reading “(More) Statistics Without the Agonizing Pain: Probability Distributions”

Having Trouble with Bedrock Errors? Check Cross-Region Inference.

I just tried to use one of the Amazon Nova models in AWS, and I got the following error message:

User: arn:aws:sts::XXXXXXXXXXXX:assumed-role/AWSReservedSSO_AccessLevelGoesHere_bec5d7da6a6db396/andy@example.com
Action: bedrock:InvokeModelWithResponseStream
On resource(s): arn:aws:bedrock:us-west-2::foundation-model/amazon.nova-micro-v1:0
Context: a service control policy explicitly denies the action

After some tooling around, I noticed that the error message was listing us-west-2 in the error message, even though I was working in us-east-2. This led me down a little rabbit hole that ultimately led to Cross-Region Inference for Amazon Bedrock. Fortunately, that article links to a nice solution.

Continue reading “Having Trouble with Bedrock Errors? Check Cross-Region Inference.”

AI is Getting Really Useful for SQL

I’m using Google BigQuery to do some ETL, and have found OpenAI’s products to be enormously helpful for the task.

A new client recently asked for some assistance working with Sunshine Act data. Since I expect additional asks about this data set to come in over time, rather than fuss with the generic UI, I decided to load the entire dataset into BigQuery instead. ChatGPT’s o3-mini-high model has generated schemas and ETL queries extremely well, accelerating my work by at least a factor of 2x.

Continue reading “AI is Getting Really Useful for SQL”

Introducing Rapier

Rapier is a code generation companion library for Google Dagger. It is designed to reduce boilerplate by generating Dagger modules for fetching configuration data from common sources.

If you’ve ever written Dagger code like this:

@Component(modules = {RapierExampleComponentEnvironmentVariableModule.class})
public interface ExampleComponent {
    @EnvironmentVariable(value = "TIMEOUT", defaultValue = "30000")
    public long getTimeout();
}‍

Then Rapier can help!

Continue reading “Introducing Rapier”

Jackson CSV Serialization and Deserialization from the Ground Up

While there are many examples of Jackson serialization to JSON, there are comparatively few resources of Jackson serialization to CSV. Following is an example of working with a TSV-formatted dataset from the ground up, starting with creating the model object, building code samples for parsing CSV to Java objects using Jackson, writing Java objects to CSV using Jackson, and ending with code to a full round-trip test for serialization.

Continue reading “Jackson CSV Serialization and Deserialization from the Ground Up”

Generating Java record classes with Jackson Annotations to map JSON using ChatGPT

There’s a lot of discussion about how to use ChatGPT to generate tests for code. Another interesting use case I’ve seen fairly little coverage of is generating DTOs from JSON. Here is an example with the prompt I’ve put together applied to JSON from the manifest of a Distributed Map Run.

Continue reading “Generating Java record classes with Jackson Annotations to map JSON using ChatGPT”

AWS SageMaker Object Detection Training Gotchas

As part of updates to arachn.io, I’ve started tinkering with object detection machine learning models. During my experiments on AWS SageMaker, I found that AutoPilot does not support object detection models, so I had train using notebooks. As a result, I hit some “gotchas” fine-tuning TensorFlow Object Detection models. While this notebook works a treat on its own training data (at least when run through SageMaker studio), this discussion will focus on things I learned while trying to run it on my own data on August 31, 2024.

Continue reading “AWS SageMaker Object Detection Training Gotchas”

Efficient Image Metadata Extraction with Java

Java has a rich set of tools for processing images built in to the standard library. However, it’s not always clear how to use that library to perform even simple tasks. There are already lots of great guides out there for working with images once they’re loaded… but what can Java do without ever loading the image into memory at all?

When working with images from untrusted sources — for example, images discovered during a web crawl — it’s best to treat data defensively. This article will show how to perform some useful tasks on images without ever loading their pixel data into memory.

Continue reading “Efficient Image Metadata Extraction with Java”