Subreddit History Is a Surprisingly Good Lie Detector

Depending on your perspective, Reddit is either one of the last great online communities or a petri dish for the internet’s most cutting-edge bots. Or, maybe, both.

One thing that’s become increasingly hard to ignore is how much conversation on Reddit is shaped by accounts that don’t really read like people. The scale of the problem is obvious, and everyone sees it: bot networks, coordinated astroturfing campaigns, accounts for sale, and now AI slop… But the tools Reddit gives users to assess account credibility are thin at best. Reddit lets you click a username and see when an account was created and a karma score. That’s about it. And that’s not very useful when karma is literally for sale.

I built Reddit Contextualizer (Chrome, Firefox) to help users identify things more easily.

Continue reading “Subreddit History Is a Surprisingly Good Lie Detector”

(More) Statistics Without the Agonizing Pain: Probability Distributions

One of my favorite conference talks of all time is Statistics Without the Agonizing Pain, by John Rauser, who was at that time head of data science at Pinterest. In this talk, he explains the statistical argument underpinning the Student’s t-Test in simple, approachable terms using an unforgettable example involving mosquitoes and beer. It’s about 15 minutes long, and well worth your time, if you haven’t watched it before.

After watching that video, I realized that—like most things—statistics is complex but ultimately straightforward once you understand the underlying ideas. The problem is that the modern approach to teaching statistics often gets in the way of that understanding. Historically, statistical methods were designed for a world where all computation had to be done by hand, so they were optimized to minimize calculation, not to maximize clarity or intuition. That design choice still has value today—efficient algorithms make modern statistical programs fast and practical. But we continue to teach statistics as if computation were still the bottleneck, even though we now all carry supercomputers in our pockets. Seen in that light, it’s obvious that the way we teach statistics has not kept up with the way we practice statistics.

To that end, I thought I’d write down some things I’ve learned about statistics over the years in a way that I hope is clearer than the average statistical textbook, mostly so I don’t forget them, but in the hopes that maybe they’ll be useful to others, too.

Continue reading “(More) Statistics Without the Agonizing Pain: Probability Distributions”

Having Trouble with Bedrock Errors? Check Cross-Region Inference.

I just tried to use one of the Amazon Nova models in AWS, and I got the following error message:

User: arn:aws:sts::XXXXXXXXXXXX:assumed-role/AWSReservedSSO_AccessLevelGoesHere_bec5d7da6a6db396/andy@example.com
Action: bedrock:InvokeModelWithResponseStream
On resource(s): arn:aws:bedrock:us-west-2::foundation-model/amazon.nova-micro-v1:0
Context: a service control policy explicitly denies the action

After some tooling around, I noticed that the error message was listing us-west-2 in the error message, even though I was working in us-east-2. This led me down a little rabbit hole that ultimately led to Cross-Region Inference for Amazon Bedrock. Fortunately, that article links to a nice solution.

Continue reading “Having Trouble with Bedrock Errors? Check Cross-Region Inference.”

AI is Getting Really Useful for SQL

I’m using Google BigQuery to do some ETL, and have found OpenAI’s products to be enormously helpful for the task.

A new client recently asked for some assistance working with Sunshine Act data. Since I expect additional asks about this data set to come in over time, rather than fuss with the generic UI, I decided to load the entire dataset into BigQuery instead. ChatGPT’s o3-mini-high model has generated schemas and ETL queries extremely well, accelerating my work by at least a factor of 2x.

Continue reading “AI is Getting Really Useful for SQL”

Introducing Rapier

Rapier is a code generation companion library for Google Dagger. It is designed to reduce boilerplate by generating Dagger modules for fetching configuration data from common sources.

If you’ve ever written Dagger code like this:

@Component(modules = {RapierExampleComponentEnvironmentVariableModule.class})
public interface ExampleComponent {
    @EnvironmentVariable(value = "TIMEOUT", defaultValue = "30000")
    public long getTimeout();
}‍

Then Rapier can help!

Continue reading “Introducing Rapier”

Jackson CSV Serialization and Deserialization from the Ground Up

While there are many examples of Jackson serialization to JSON, there are comparatively few resources of Jackson serialization to CSV. Following is an example of working with a TSV-formatted dataset from the ground up, starting with creating the model object, building code samples for parsing CSV to Java objects using Jackson, writing Java objects to CSV using Jackson, and ending with code to a full round-trip test for serialization.

Continue reading “Jackson CSV Serialization and Deserialization from the Ground Up”

Generating Java record classes with Jackson Annotations to map JSON using ChatGPT

There’s a lot of discussion about how to use ChatGPT to generate tests for code. Another interesting use case I’ve seen fairly little coverage of is generating DTOs from JSON. Here is an example with the prompt I’ve put together applied to JSON from the manifest of a Distributed Map Run.

Continue reading “Generating Java record classes with Jackson Annotations to map JSON using ChatGPT”

AWS SageMaker Object Detection Training Gotchas

As part of updates to arachn.io, I’ve started tinkering with object detection machine learning models. During my experiments on AWS SageMaker, I found that AutoPilot does not support object detection models, so I had train using notebooks. As a result, I hit some “gotchas” fine-tuning TensorFlow Object Detection models. While this notebook works a treat on its own training data (at least when run through SageMaker studio), this discussion will focus on things I learned while trying to run it on my own data on August 31, 2024.

Continue reading “AWS SageMaker Object Detection Training Gotchas”