A Cautionary Tale

Some of you may have seen that on Thursday, I had a bad day.

In the grand scheme of things it wasn’t that bad a day; it actually turned out pretty well. But what follows is a post mortem on how I accidentally pinged every Android customer in production, and a little bit of advice so that it doesn’t happen to you.

Background

Marketing wanted both Android and iOS to integrate a CRM library which supported push notifications. We historically haven’t really used this capability, and we want to be able to experiment with being able to target segments and do all sorts of clever things.

However, our Android app has never supported push notifications - infact we’d brought this up plenty of times when discussing features, and it was one of the tasks on our backlog to gain full feature-parity with our iOS colleagues. Consequently, I set out to work out the exact state of push notifications on the Cuvva Android app - a large codebase with a fair bit of legacy cruft still left in it (although we’ve been very successful in backing it into a corner).

On FCM

If you’ve not integrated push notifications via Firebase Cloud Messaging before, it’s very easy to set up. And it’s also simple to work out if your app supports this functionality already or not.

As we all know, the Android Manifest is a declaration of what your app is capable of. The Android system looks at this manifest and does all kinds of things - making notes of Intents that your app supports, instantiating Services, etc. If you have an FCM integration, you’ll find something like this:

<service
    android:name=".MessagingService"
    android:exported="false">
    <intent-filter>
        <action android:name="com.google.firebase.MESSAGING_EVENT" />
    </intent-filter>
</service>

It’s this com.google.firebase.MESSAGING_EVENT which is important, and it makes it trivially easy to work out conclusively whether or not a codebase has push support. However, this service could come from an SDK that your app consumes, in which case checking your :app manifest isn’t enough.

To be really sure what your app supports/declares, you need to check the merged manifest. The long story short is that when compiling your app, the Android toolchain merges the manifest files of each module and library together into one giant XML file, and sometimes you can find surprises in there. We’ve been bitten before by third-party libraries that include permissions that we don’t expect, only to realise later on when the Google Play Console warns us - more on how you can avoid this later. To check your merged manifest, head to your main :app module’s AndroidManifest.xml, and at the bottom of the window you’ll see two tabs - Text and Merged Manifest.

It wasn’t failing to check this that got me - the merged manifest was clean. I even dumped the entire dependency tree to check for instances of Firebase Messaging or legacy GCM libraries using ./gradlew -q app:dependencies > deps.txt. I saw nothing. I grepped the app for usages of NotificationManager, NotificationCompat and the like. Clean.

Next up was checking the FCM console itself. Within the project, I could see that we had iOS configured, web too - no mention of Android. The FCM events history graph had zero messages sent this quarter. I went to GCP console - FCM wasn’t enabled for our GCP project, so clearly it hadn’t been integrated on the backend side of things either. Confirmed this with a Lead Engineer.

At this point I was 100% convinced that there was no way that our Android app could pick up a notification. So I opened the “Android Setup” tutorial and got to work.

Time to Test

FCM integration is trivially easy - just a class to extend, a Service to declare. Trickiest part was briefly forgetting that you can’t use constructor injection with a Service; you need to use field injection rather like an Activity entry point. You’re probably using Hilt now to make this simpler, but for dagger-android users:

class MessagingService : FirebaseMessagingService() {

    @Inject
    internal lateinit var pushHelper: SomeHelper

    override fun onCreate() {
        AndroidInjection.inject(this)
        super.onCreate()
    }

    override fun onMessageReceived(remoteMessage: RemoteMessage) {
        // ...
    }
}

@Module
abstract class NotificationModule {

    @ServiceScoped
    @ContributesAndroidInjector
    abstract fun provideMessagingService(): MessagingService
}

Once happy with our Android integration, I looked at the FCM console and saw it: Send your first message.

What could go wrong, indeed

I entered a few details: a title, a message. I deleted my original message; it was too boring. I put in something that I thought was ironic and funny. My genius knows no bounds: shame no-one would ever know. I chose the user segment - Android - and FCM warned me that would affect 533k devices (in hindsight - this number didn’t match our analytics, so where did it come from? That’s right: registered push tokens). At this stage I did feel a little nervous, but I was absolutely certain that there was no way that users would see the message. So certain infact that I displayed a stunning amount of hubris by Tweeting about my message, 30 seconds before sending.

Less than a minute later, I get my first reply, featuring a screenshot of my notification their device.

Oh.

Oops

I wasn’t quite sure of the scale - after all, push notifications are best-effort and there’s tonnes of reasons why they might not hit devices. However, immediately my timeline, and that of Cuvva, absolutely blew up. Our Slack channel for monitoring mentions of Cuvva exploded. We were inundated with screenshots of users receiving my test notification. It made it to /r/programmerHumor in less than 20 seconds. I was mortified.

The Impact

All-in-all, around 120,000 push notifications were received, and about 20,000 of these were opened. This was quite a good conversion rate! But aside from that, the impact was… pretty positive?

Our customers seemed to find it hilariously funny, and we saw the most social media engagement that we’ve had in ages. My colleagues also found it very entertaining - my colleague and iOS Lead, Matt, now uses my hubristic Tweet as his meeting background. I got a fair amount of well-deserved ribbing.

But overall - this incident was about as good as a major production incident can be.

The Cause

I thought I’d been pretty diligent about looking for possible notification services, so I couldn’t understand what had happened. There was one breadcrumb though - one of the many people who tweeted us with screenshots had a notification icon that wasn’t ours. It was the default Android one. This to me implied that:

  • We definitely hadn’t built this functionality ourselves and forgotten about it
  • It was probably some library that we were consuming
  • This library hadn’t built the service particularly well, perhaps including it accidentally

So my immediate thought was to double check that no third-party SDK had snuck something in. I didn’t see anything though, so I did a bit of an audit. My main suspects for a while where various analytics or marketing tools, but all of their documentation suggested that they did push support properly. All-in-all I must have spent a couple of hours scratching my head. I was sure it must have been an errant library.

And then I realised. We’d recently removed a big lib.

I checked out a tag from two weeks ago which still contained my suspect. I looked in the merged manifest - and there it was: a completely standard, unextended FCM service.

But it wasn’t conclusive. Normally you can click on an entry in this merged manifest and Android Studio will take you to the library which declared this entry. It’s an incredibly useful feature, but presumably because this FCM service hadn’t been extended, clicking through just took me to FCM itself. I dug out ./gradlew -q app:dependencies once again, grepped for messaging, and there it was. This library had, accidentally or otherwise, included the default FCM service which was posting all received messages as notifications. I won’t name and shame this library as it’s fairly specialist, but I can guarantee that you’re not using it.

The library in question was being evaluated for months for a big upcoming feature and we had dropped it two weeks ago. Version 4.1.2 of the app had it, the two releases after (and develop) did not. But minor issues in our rollouts meant that the majority of users were still on the 4.1.2 train, and therefore most users got notified.

On FCM in Libraries

If you’re a library maintainer and you happen to need to sneak in a push notification service, please make sure that inbound messages are relevant to you before notifying a user:

class MessagingService : FirebaseMessagingService() {

    override fun onMessageReceived(remoteMessage: RemoteMessage) {
        if (isMyLibrarysProblem(remoteMessage)) {
            showNotification(...)
        }
    }
}

Even better - don’t sneak in this functionality, and abstract out a helper so that consumers are able to delegate to you if required. Intercom does this, and it’s the correct way to handle this stuff.

Learnings

There’s plenty to take away from this incident about how to prevent the same thing happening to you:

Your Build Isn’t Necessarily Your Customer’s Build

What really messed me up here was making assumptions about the state of our app in the wild. Do not do this. Check develop but check out your most popular tagged release too, and give it a quick sanity-check.

Carefully Vet Libraries

When adding large third-party libraries such as CRM, analytics or telematics packages, make sure that they don’t include anything that you don’t want. Add the dependency and then inspect your merged manifest carefully for anything that you don’t expect.

Automate PRs

You can automate this auditing too.

I mentioned earlier that we’d had an issue previously with a library (the same library, go figure) merging in permissions that we didn’t expect. There’s a tool which can catch this via danger.systems, and you can find it here. All it does is compare the permissions in your APK to a list that you previously generated, and then fails the build if something was added. Obviously this wouldn’t have helped in this instance, but it would be pretty easy to base a library ontop of this code to provide the same, err, service, for Services. I will look into this and publish it if someone doesn’t beat me to it.

Use Non-Prod

Even if you’re 100% certain that you’re safe when integrating notifications, it’s better to assume that you aren’t. At some point you’ll need to test something while your previous integration was live, so why not start with a non-prod FCM environment anyway? Either create a new project in the Firebase console and associate it with a test, UAT (User Acceptance Testing) build or something similar (good), or just add multiple versions of the app to your prod environment and target that segment (less good). Then send push notifications to your heart’s content, safe in the knowledge that you couldn’t possibly be notifying customers. Just make sure that you’re in the right project before you start pressing send.

Use Filtering

If you have push notifications setup already, consider adding some filtering to messages before notifying users. Perhaps include an environment in the message payload, compare to the current build and decide whether or not to notify from there.

Write a Good Message

Even with multiple safeguards, assume that you will screw up at some point and that someone will see it. Do not write something you wouldn’t want customers to see. If you can pretend the message is from marketing, even better. A simple “Happy Thursday!” won’t be given a second thought by users. “Test notification” looks bad, “Is this fucking thing working lolol” might just endanger your job.

On the Android Community

One of the loveliest bits of fallout from all of this was the messages of support from my colleagues in the Android space. Sending a test notification to production is a right of passage for mobile engineers, and this happened to be my time.

Loads of people shared brilliant stories about how they’d broken production, notified users, wiped databases clean or seen people nuke their career with one broadcasted expletive. It was heartwarming, funny, and to each and every one of you who reached out: thank you.

Wrapping Up

  • Do not test notifications in production
  • Ensure that you have a different build flavour which is hooked up to a separate Firebase project so that you can test notifications. Either that, or ensure that your existing integration has some kind of filtering logic to ensure that it only posts notifications which are relevant - filter out debug ones.
  • Do not test notifications in production
  • Carefully vet large third-party SDKs - particularly CRM/Marketing/Analytics packages.
  • Do not test notifications in production
  • Your app is not necessarily the customer’s app - bear this in mind before you do anything with production, as what you’re seeing on develop doesn’t necessarily reflect what’s out in the wild.
  • Do not test notifications in production
  • If you have to test notifications, use a message you can pretend that you meant to do.

And finally: Do. Not. Test. Notifications. In. Production.

comments powered by Disqus