From what I can gather, twitter had an image storage problem, handling 200gb per second.
Twitter used to upload tweets along with images, and pass the image, along with the tweet, through the entire pipeline.
Once saved images would live forever in the database.
To resolve this, twitter made media uploads (images, videos) separate. When a tweet is created that uses an image, a handle for that image would be sent along with the tweet instead of the entire image data.
They also gave images a 20 day TTL (time to live). After 20 days, images would be removed from the database.
In order to improve usage in slow internet areas, they created segmented, reusable uploads. When a user decides to upload an image, the upload is broken into segments and the segments are uploaded. If anyone upload fails, it can be done again. If you walk into a subway and lose connectivity, when you come out, the upload will be resumed.
Facebook had to deal with massive, short lived, viewership spikes during live video streams.
- How to stream the live video content around the globe?
- Avoid the thundering herd problem?
- Thundering Herd: When a large number of process waiting for some event are awoken for that event, but only 1 can process can proceed at a time.
- Waking up processes takes resources. Would be faster to just wake up 1 at a time
- Millions of viewers all requesting the same video content, it’s a thundering herd.
- When a video stream is started, it is streamed to a Live Stream Server
- The live stream server transcodes the video stream into multiple bit rates
- For each bit rate, a set of 1 second segments is produced
- These 1 second segments are then sent to a data center cache
- The data center then sends the segments to Points of Presence (PoP) around the globe
- Users receive the video stream form the PoPs
- When a user requests video from a PoP, the user is acutely talking with an HTTP proxy layer within the PoP
- The HTTP proxy checks if the requested video segment is in the PoP cache
- If not, then a request is sent to the data center to get that segment
- If it is, the segment is sent to the user
The Thundering Herd
If many users, at the same PoP, request the same segment, at the same time, and the segment does not exist in the cache and must be retrieved from the datacenter, then there may exist a thundering herd.
To solve this, only the first request for a segment is sent, all the other users who want that segment wait on that request.
The HTTP proxies also have a caching layer. Only the first request to the HTTP proxy will make a call to the PoP for the video segment. The others will wait on that request.
Over the course of 1 year and a half, uber went form 200 to 2000 engineers, producing more than 7000 git repos and more than 1000 microservices.
Here are some of the aspects of Ubers massive growth and the tradeoffs that came with it.
- A microservices focused aporach allows teams to be formed quickly and hit the floor runnning. These teams can work independantly of each other (mostely) and in their own language. It also allows those teams to put their own service into production.
- But making changes could break something and cause an avalanche effect. As stated by Matt Ranney somewhere in the source video, Uber was most stable over the weekend, when engineers were at home, not changing things.
- If something goes wrong, how do you figure out where in the chain of services the break occured.
- Another less obvious cost of this is "polotics" as Matt Ranney put it. Instead of contrfonting another team over issues with their service, you could just write more code to get around such issues. I would personally find it easier to just confront the other team but then again, I've never been the Chief Systems Architect at Uber, so I'll take Matts' word for it.
Git Repos... Everywhere
Uber has skyrocketed into 7000+ git repos.
One is bad, one is good. Many are bad, many are good
- A monolith repo allows quick navigation of the code base.
- Easier to make changes spanning across the code base
- May be easier to understand the project as a whole
- Monolith may get so bg building the project or checking the out becomes unreasnobly difficult unless you have some kind fo special tool to ease this.
- (Acroding to the source video, Google has a monolith repo and a virtual file system to make checking out specific parts of it easier)
- A monolith may not work well with Ubers "many teams, many languages, many services" approach.
RPC is slow, but it works
How should communication between all those services be handled
- HTTP comms introduce interpretation issues at such a large scale (to many moving parts)
- How should we interperet the many aspects (moving parts) of HTTP
- What does this HEADER mean
- When should this status code be used, what about this one...
- What about the query string format
- When should we use PUT vs POST etc...
- Keeping these intepretations consistent throught 7000+ projects and teams is not easy
- JSON is human readable and flexible but it lacks types.
- An empty string and null are the same thing, but some service down the line may not see it as so
- Type coercion in one language (Uber uses more than just JS for its many services) may not be the same in another. i.e. '5 ' == 5 in JS but mabey not in PHP or Go.
- RPC is slower but it makes sense in Ubers case
- All the services Uber uses are in house
- Communicating with servers/services that you own is not like communicating with a browser
- Thus you can use RPC (one does not use RPC to communicate with a browser)
- Thus, 1 service communicating with another is similar to making a function call
- RPC is like making a function call. Put 2 and 2 together.
Lots of Languages
- Having lots of different languages can fragment the culture
- Separation of Concerns: If necessary, separate different aspects of your application to different servers. For example, a server (or cluster) for API requests, and one for serving static content.
- During long processes pipelines, avoid passing unnecessary data through the pipeline. (Twitter)
- Chose the right communication scheme for large, in house, service to service communications (Uber)
- Cache data where necessary, and appropriately handle large request spikes for such data as to avoid issues such as the thundering heard. (Facebook)
- Be aware of the trade offs of any decision
- No your requirements and growth predictions