How (Not) to Return Data from Your Api

APIs serve as vital data conduits between systems. Maximizing data availability empowers users, suggesting a shift towards providing extensive datasets without dwelling on permissions or privacy. This model advocates for APIs as rich data reservoirs, simplifying consumer access and use.

10 months ago   •   9 min read

By Stephen Rees-Carter
Table of contents

The point of an API is to transport data between one system and another, and the more data you make available through an API, the more a consumer can do with it. Therefore, it makes logical sense to return absolutely everything from your API. You don’t need to worry about any annoying permissions or privacy considerations - let the consumer figure out what the user needs to see and ignore the rest. 

Job done, get paid, walk away!

Now, I could stop here and let you figure out why what I’ve just said is dangerous rubbish, but there is a pesky little thing called ethics that unfortunately I like to hold myself accountable to, so instead, let’s take a look at why you can’t just throw all of your data into your API and walk away!

This is a topic that I go on and on about, just ask my security audit clients! I’m usually talking about it within the context of Single-Page Applications (SPAs) - where developers love to send absolutely everything to the browser - but the context is exactly the same with external APIs and everything in between. It’s even valid when you’re building HTML server-side and data is going into tag attributes or inline javascript. Ultimately it’s all some form of an API, and all of that data is making its way to the consumer - be it a web browser, API client, or a HTTP class within another application.

One of the great things about working in cybersecurity is the wealth of perfect examples we are provided with, in the form of some company getting something spectacularly wrong. As such, we have the perfect example of why you can’t just include everything in your API… Spoutable.

Did You Hear about Spoutable?

If you haven’t heard of Spoutable before, it’s one of the many social media services that was set up after Twitter changed hands. Its purpose, like most social media, was simple: let users sign up for accounts and interact with each other via messages and profile pages.

When you’re building a social media profile page, there are a few key pieces of information you need to build the page. Something like this lot:

  • Username
  • Display Name
  • About Me / Biography
  • Link(s)
  • Follower Count
  • Following Count
  • Post Count
  • Joined Date
  • Profile Photo

As you’d expect, Spoutable returned that sort of data, but they didn’t stop there… 

They also included:

  • Email Address
  • IP Address
  • Phone Number
  • Gender

What’s a little PII (Personally Identifiable Information) among friends, right? 

You could justify returning this data when you’re accessing your own profile page - but it was included on the public API. Any Spoutable user could send an API request to obtain this information about any other use.

But they didn’t stop here. 

To quote Troy Hunt, the creator of Have I Been Pwned:

Ever hear one of those stories where as it unravels, you lean in ever closer and mutter “No way! No way! NO WAY!” This one, as far as infosec stories go, had me leaning and muttering like never before.

The API also included (I’ll only be giving you the highlights, go check out Troy’s excellent writeup where he goes into all of the gory details!):

  • bcrypt hashed password
  • 2FA secret key
  • 2FA backup code hash
  • Password reset email token
  • and lots more…

Unlike the previous lot of PII, which is a very clear privacy violation and could be used to target the users specifically, externally to Spoutable, this set of information has a direct impact on Spoutable.

If the user’s password is weak enough, that bcrypt hash will be cracked in minutes, and having 2FA enabled won’t make a difference when all the hacker needs to do is put the 2FA secret key into their authenticator app - or crack the 2FA backup code. It’s a 6-digit number, so cracking it will take a very small number of minutes. Or the hacker can just use the password reset email token and reset the password instantly…

All of this was possible because the API returned too much data.

So we’ve established that returning too much data via an API is bad, but how did it happen, and what should we do to avoid it?

How’d This Happen?

The leaky Spoutable API appears to have been an internal one - for use with their API, rather than an externally published API for 3rd party clients. This is where a lot of developers come unstuck when it comes to data leaks. They spend their time thinking about the privacy risks of an external API, but will happily send everything through to their SPA in the browser - not even considering that your SPA also consumes an API!

In this case, I’m going to make the educated guess that the developer loaded the user model from the database using their framework or database ORM (Object Relational Mapper), which would have loaded the entire record for use. This is fine when working in the backend code, as the user (and browser) never sees this data, but when you proceed to send it to the front end (i.e. the SPA), it suddenly becomes public knowledge. 

The developer would’ve loaded the record, sent it off to the front end, and then the javascript they needed to build the page, accessing only the data they needed from the user record and not checking or realising just how much data was included in this request.

It sounds far too simple a mistake to make, but I’ve seen the same situation (although none with that much sensitive data) many times across the security audits that I’ve worked on for my clients. These types of vulnerabilities almost always occur through simple code when the developer overlooks something that appears to be obvious in hindsight.

So we’ve looked at what data was returned, and how it likely happened, but how do we prevent it?

Avoiding Leaking Data

My recommendation is to be paranoid and explicit, and only return the data you actually need - rather than absolutely everything.

Let me give you a basic example using Laravel:

function show(string $username)
{
    return User::where(‘username’, $username)->get();
}

(Note, I’ve ignored error handling to make the example simpler.)

In the above code, we’re taking the provided username, finding a matching record, and returning the whole thing. It makes for incredibly clean code, but it’s also returning all of the data on that record - and as the developer, we have no visibility of what that data even is. It may be safe data when we wrote this action, but if a sensitive field is added later, that’ll be included in the response automatically too!

Instead, my proposal is to be explicit and define exactly what data you need sent to the SPA. This gives you immediate visibility of what data is going where, so you can be sure you’re only sending information that should be sent - and any added fields are automatically ignored.

Take a look:

function show(string $username)
{
    $user = User::where(‘username’, $username)->get();

    return [
        'username' => $user->username,
        'display_name' => $user->display_name,
        ‘biography’ => $user->biography,
        'links' => $user->links,
        'follower_count' => $user->follower_count,
        'following_count' => $user->following_count,
        'post_count' => $user->post_count,
        'joined_date' => $user->joined_date,
        'profile_photo' => $user->profile_photo,
    ];
}

The difference should be immediate - you can see exactly what data is being returned. The chances of sensitive data being returned will be significantly lowered. You’d have to choose to include sensitive data in the response, and (hopefully) you’d be aware of the implications and access levels at that point.

Another thing to consider is when you’re working with internal APIs like this, you often have the freedom to make changes fairly quickly. This is another vote in favour of only returning the minimum amount of data you need. 

For example, if your first version of a profile page just shows the username, display name, and biography, then only return those three values. When you add in profile picture support, add that to the response too. Need follower count? Add it in when you need it.

This data may be public data you give away freely, but if the front end doesn’t need it, you don’t need to send it. 

As I said, I recommend being paranoid.

Alright, so we’ve looked at an issue with an internal API, but what about a public API?

This Happens with External Apis Too

At this point in my research, I went looking for good news articles about external APIs that leaked a bunch of data. I found a bunch of news reports, but none of them had decent examples of what actually happened with the leaky API. 

So instead, let’s look at a hypothetical example instead, one which showcases a possible way for data to be leaked via an API.

Let’s start with an API endpoint that returns posts:

function getPostsWithComments()
{
    $posts = Post::with('comments')
        ->select('slug', 'title', ‘author’, 'content', 'published_at')
        ->get();

    return response()->json($posts);
}

The developer has been careful to select the database columns for the post, to ensure only the information that should be returned is returned, and is adding the comments for that post into the response too. Looks like pretty standard stuff, and they felt happy they’d covered any privacy concerns.

They then went on to document this API using their manual API documentation approach, and ended up with this list of response attributes:

  • slug - unique post identifier
  • title - full post title
  • author - display name of the post author
  • content - post content, in full HTML
  • published_at - timestamp when post was published
  • comments - array of comments 

At this point the developer committed, pushed, and was done with the job. 

They never tested it with realistic data and failed to notice the one massive hole…

[
  {
    "slug": “first-post”,
    "title": "First Post",
    “author”: “Stephen Rees-Carter”,
    "content": "This is the content of the first post.",
    "published_at": "2024-03-07 12:30:00",
    "comments": [
      {
        "id": 1,
        "post_id": 1,
        "user_id": 1,
        "content": "Great post!",
        "email": "[email protected]",
        "ip_address": "192.168.1.1",
        "created_at": "2024-03-07 12:35:00"
      }
    ],
    // ...
  }
]

Yep! The exact same issue as we had in the internal API. The comments model returned more data than expected, and the developer didn’t check or limit it. They also made assumptions when documenting the endpoint, so it’s not documented either.

My understanding is that data leaking via API usually occurs when a developer doesn’t realise they are sending too much data in the response. It’s an easy trap to fall into - developers make assumptions all the time of what our code is going to do all the time.

It’s also especially easy to overlook and miss these issues with APIs as we’re often not looking at the actual responses - but rather just consuming them programmatically and using the fields we know should be there. It comes down to a lack of visibility in what data APIs are actually providing.

Gaining Visibility to APIs

The way to solve this issue of leaky APIs is to gain visibility of what data it’s returning, both during development with decent test data, and in production with decent monitoring.

Setting up decent test data in local development can be tricky if you have to do it manually, but many web frameworks include the concept of Seeders to make the job easier. Seeders are tools you can use to define the types of data that’s supposed to be in your database, building realistic looking fake data and all of the relationships between them. With a decent set of seeders, your API testing will have a lot of useful data to work with.

Once you have useful test data in your local dev environment, or on a staging site, you need to run your own API calls and check what data is returned. Not just through the apps or SDKs you’re building to integrate with the API, but an actual API toolkit so the focus is on the returned data.

If you’re on MacOS, check out Aspen for a great API client, and if you’re on Windows like me, Nightingale is a good option (but there is an Aspen Windows version coming soon, psst).

Use these tools during development and testing to see what your API is returning. Not only will it flag any leaky data, but it’ll give you a better understanding of what data and what format you’re returning it in.

You also need decent documentation for your APIs - and doing that by hand is a guaranteed way to miss things and leak data. Instead, if you use a tool like Treblle’s API Documentation, it’ll automatically identify all of the data you’re returning from your API. So when you’re monitoring your API and building the documentation, you’ll gain visibility of those extra fields that shouldn’t be there.

This is an incredibly valuable tool to add into an existing API and will gain you significant insights into what data you’re already exposing, and the areas you perhaps need to make some changes. It’ll also help you identify changes over time as new values are added to API payloads.

That said, you really need to catch this stuff in the development process - before sensitive data is exposed on production.

Summary

To wrap all of this up, just remember what I said at the start: just return everything via your API and walk away, job done. APIs can leak a whole lot of sensitive data, and so you can’t simply return everything and walk away. Instead, you need to be proactive about controlling the data you include in your API responses and keep an eye on exactly what it contains. This is where the right tooling comes in.

Also, don’t forget that APIs come in all shapes and sizes - it’s not just externally published APIs you need to protect!

Spread the word

Keep reading