2011年2月8日 星期二

Distributed key-value storage system:重要

Voldemort
http://project-voldemort.com/

各公司的Architecture

所有的architecture
http://highscalability.com/blog/category/example


Flickr Architecture
http://highscalability.com/flickr-architecture

Amazon Architecture
http://highscalability.com/amazon-architecture

LinkedIn
http://furiouspurpose.blogspot.com/2007/11/qcon-linkedin-architecture.html
http://highscalability.com/blog/2008/6/4/linkedin-architecture.html

Google Architecture
http://highscalability.com/google-architecture

YouTube Architecture
http://highscalability.com/youtube-architecture

PlentyOfFish Architecture
http://highscalability.com/plentyoffish-architecture

MySpace Architecture
http://highscalability.com/myspace-architecture
=====================================================================================
QCon: LinkedIn Architecture
Notes from the Linked-In: Lessons learned and growth and scalability session at QCon with Jean-Luc Vaillant.

Their architecture includes:

Java (trying out some Ruby, adding some C++, as little as possible)
Oracle 10g and MySQL
Spring
ActiveMQ (tried OracleMQ, doesn't recommend it)
Tomcat & Jetty
Lucene
Graph computations don't perform very well in a relationship database: with large numbers of members, and large numbers of connections, the combinatorics can be staggering. Add to this that simple approaches to storing this information would require extensive joining. Best way to get performance was to run the algorithms on the graph in RAM.

That raises the connection of how to keep the RAM database in sync at all times. One option is to update the database and inform other engines of changes through direct RPC, reliable multicast, JMS. This has the typical problems of two-phase commit.

An alternate approach that LinkedIn has used is to log changes in a transaction log which can be pulled from each graph engine into RAM as necessary. The approach is currently Oracle-specific, but it is applicable to just about any database.

Once that's in place, the in-memory techniques for traversing the graph are far less painful. Breadth-first traversal to get connections of various degrees. Using symmetry to find connections from both sides.

Having run into issues with Read-Write Lock, he prefers Copy On Write.

=============================================================
LinkedIn is the largest professional networking site in the world. LinkedIn employees presented two sessions about their server architecture at JavaOne 2008. This post contains a summary of these presentations.

Key topics include:

Up-to-date statistics about the LinkedIn user base and activity level

The evolution of the LinkedIn architecture, from 2003 to 2008

"The Cloud", the specialized server that maintains the LinkedIn network graph

Their communication architecture

==========================================================
Google Architecture
SATURDAY, NOVEMBER 22, 2008 AT 10:01AM
Update 2: Sorting 1 PB with MapReduce. PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks.
Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters. Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.

Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build a higher performing higher scaling infrastructure to support their products. How do they do that?

Information Sources
Video: Building Large Systems at Google
Google Lab: The Google File System
Google Lab: MapReduce: Simplified Data Processing on Large Clusters
Google Lab: BigTable.
Video: BigTable: A Distributed Structured Storage System.
Google Lab: The Chubby Lock Service for Loosely-Coupled Distributed Systems.
How Google Works by David Carr in Baseline Magazine.
Google Lab: Interpreting the Data: Parallel Analysis with Sawzall.
Dare Obasonjo's Notes on the scalability conference.
Platform
Linux
A large diversity of languages: Python, Java, C++
What's Inside?
The Stats
Estimated 450,000 low-cost commodity servers in 2006
In 2005 Google indexed 8 billion web pages. By now, who knows?
Currently there over 200 GFS clusters at Google. A cluster can have 1000 or even 5000 machines. Pools of tens of thousands of machines retrieve data from GFS clusters that run as large as 5 petabytes of storage. Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster.
Currently there are 6000 MapReduce applications at Google and hundreds of new applications are being written each month.
BigTable scales to store billions of URLs, hundreds of terabytes of satellite imagery, and preferences for hundreds of millions of users.
The Stack
Google visualizes their infrastructure as a three layer stack:

Products: search, advertising, email, maps, video, chat, blogger
Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.
Computing Platforms: a bunch of machines in a bunch of different data centers
Make sure easy for folks in the company to deploy at a low cost.
Look at price performance data on a per application basis. Spend more money on hardware to not lose log data, but spend less on other types of data. Having said that, they don't lose data.
Reliable Storage Mechanism With GFS (Google File System)
Reliable scalable storage is a core need of any application. GFS is their core storage platform.
Google File System - large distributed log structured file system in which they throw in a lot of data.
Why build it instead of using something off the shelf? Because they control everything and it's the platform that distinguishes them from everyone else. They required:
- high reliability across data centers
- scalability to thousands of network nodes
- huge read/write bandwidth requirements
- support for large blocks of data which are gigabytes in size.
- efficient distribution of operations across nodes to reduce bottlenecks
System has master and chunk servers.
- Master servers keep metadata on the various data files. Data are stored in the file system in 64MB chunks. Clients talk to the master servers to perform metadata operations on files and to locate the chunk server that contains the needed they need on disk.
- Chunk servers store the actual data on disk. Each chunk is replicated across three different chunk servers to create redundancy in case of server crashes. Once directed by a master server, a client application retrieves files directly from chunk servers.
A new application coming on line can use an existing GFS cluster or they can make your own. It would be interesting to understand the provisioning process they use across their data centers.
Key is enough infrastructure to make sure people have choices for their application. GFS can be tuned to fit individual application needs.
Do Something With The Data Using MapReduce
Now that you have a good storage system, how do you do anything with so much data? Let's say you have many TBs of data stored across a 1000 machines. Databases don't scale or cost effectively scale to those levels. That's where MapReduce comes in.
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Why use MapReduce?
- Nice way to partition tasks across lots of machines.
- Handle machine failure.
- Works across different application types, like search and ads. Almost every application has map reduce type operations. You can precompute useful data, find word counts, sort TBs of data, etc.
- Computation can automatically move closer to the IO source.
The MapReduce system has three different types of servers.
- The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks.
- The Map servers accept user input and performs map operations on them. The results are written to intermediate files
- The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.
For example, you want to count the number of words in all web pages. You would feed all the pages stored on GFS into MapReduce. This would all be happening on 1000s of machines simultaneously and all the coordination, job scheduling, failure handling, and data transport would be done automatically.
- The steps look like: GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS.
- In MapReduce a map maps one view of data to another, producing a key value pair, which in our example is word and count.
- Shuffling aggregates key types.
- The reductions sums up all the key value pairs and produces the final answer.
The Google indexing pipeline has about 20 different map reductions. A pipeline looks at data with a whole bunch of records and aggregating keys. A second map-reduce comes a long, takes that result and does something else. And so on.
Programs can be very small. As little as 20 to 50 lines of code.
One problem is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.
Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.
Storing Structured Data In BigTable
BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.
BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.
It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.
Commercial databases simply don't scale to this level and they don't work across 1000s machines.
By controlling their own low level storage system Google gets more control and leverage to improve their system. For example, if they want features that make cross data center operations easier, they can build it in.
Machines can be added and deleted while the system is running and the whole system just works.
Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp.
Each row is stored in one or more tablets. A tablet is a sequence of 64KB blocks in a data format called SSTable.
BigTable has three different types of servers:
- The Master servers assign tablets to tablet servers. They track where tablets are located and redistributes tasks as needed.
- The Tablet servers process read/write requests for tablets. They split tablets when they exceed size limits (usually 100MB - 200MB). When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.
- The Lock servers form a distributed lock service. Operations like opening a tablet for writing, Master aribtration, and access control checking require mutual exclusion.
A locality group can be used to physically store related bits of data together for better locality of reference.
Tablets are cached in RAM as much as possible.
Hardware
When you have a lot of machines how do you build them to be cost efficient and use power efficiently?
Use ultra cheap commodity hardware and built software on top to handle their death.
A 1,000-fold computer power increase can be had for a 33 times lower cost if you you use a failure-prone infrastructure rather than an infrastructure built on highly reliable components. You must build reliability on top of unreliability for this strategy to work.
Linux, in-house rack design, PC class mother boards, low end storage.
Price per wattage on performance basis isn't getting better. Have huge power and cooling issues.
Use a mix of collocation and their own data centers.
Misc
Push changes out quickly rather than wait for QA.
Libraries are the predominant way of building programs.
Some are applications are provided as services, like crawling.
An infrastructure handles versioning of applications so they can be release without a fear of breaking things.
Future Directions For Google
Support geo-distributed clusters.
Create a single global namespace for all data. Currently data is segregated by cluster.
More and better automated migration of data and computation.
Solve consistency issues that happen when you couple wide area replication with network partitioning (e.g. keeping services up even if a cluster goes offline for maintenance or due to some sort of outage).
Lessons Learned
Infrastructure can be a competitive advantage. It certainly is for Google. They can roll out new internet services faster, cheaper, and at scale at few others can compete with. Many companies take a completely different approach. Many companies treat infrastructure as an expense. Each group will use completely different technologies and their will be little planning and commonality of how to build systems. Google thinks of themselves as a systems engineering company, which is a very refreshing way to look at building software.

Spanning multiple data centers is still an unsolved problem. Most websites are in one and at most two data centers. How to fully distribute a website across a set of data centers is, shall we say, tricky.

Take a look at Hadoop if you don't have the time to rebuild all this infrastructure from scratch yourself. Hadoop is an open source implementation of many of the same ideas presented here.

An under appreciated advantage of a platform approach is junior developers can quickly and confidently create robust applications on top of the platform. If every project needs to create the same distributed infrastructure wheel you'll run into difficulty because the people who know how to do this are relatively rare.

Synergy isn't always crap. By making all parts of a system work together an improvement in one helps them all. Improve the file system and everyone benefits immediately and transparently. If every project uses a different file system then there's no continual incremental improvement across the entire stack.

Build self-managing systems that work without having to take the system down. This allows you to more easily rebalance resources across servers, add more capacity dynamically, bring machines off line, and gracefully handle upgrades.

Create a Darwinian infrastructure. Perform time consuming operation in parallel and take the winner.

Don't ignore the Academy. Academia has a lot of good ideas that don't get translated into production environments. Most of what Google has done has prior art, just not prior large scale deployment.

Consider compression. Compression is a good option when you have a lot of CPU to throw around and limited IO.

=============================================
YouTube Architecture
WEDNESDAY, MARCH 12, 2008 AT 3:54PM
Update 2: YouTube Reaches One Billion Views Per Day. That’s at least 11,574 views per second, 694,444 views per minute, and 41,666,667 views per hour.
Update: YouTube: The Platform. YouTube adds a new rich set of APIs in order to become your video platform leader--all for free. Upload, edit, watch, search, and comment on video from your own site without visiting YouTube. Compose your site internally from APIs because you'll need to expose them later anyway.

YouTube grew incredibly fast, to over 100 million video views per day, with only a handful of people responsible for scaling the site. How did they manage to deliver all that video to all those users? And how have they evolved since being acquired by Google?

Information Sources
Google Video
Platform
Apache
Python
Linux (SuSe)
MySQL
psyco, a dynamic python->C compiler
lighttpd for video instead of Apache
What's Inside?
The Stats
Supports the delivery of over 100 million videos per day.
Founded 2/2005
3/2006 30 million video views/day
7/2006 100 million video views/day
2 sysadmins, 2 scalability software architects
2 feature developers, 2 network engineers, 1 DBA
Recipe For Handling Rapid Growth
while (true)
{
identify_and_fix_bottlenecks();
drink();
sleep();
notice_new_bottleneck();
}

This loop runs many times a day.

Web Servers
NetScalar is used for load balancing and caching static content.
Run Apache with mod_fast_cgi.
Requests are routed for handling by a Python application server.
Application server talks to various databases and other informations sources to get all the data and formats the html page.
Can usually scale web tier by adding more machines.
The Python web code is usually NOT the bottleneck, it spends most of its time blocked on RPCs.
Python allows rapid flexible development and deployment. This is critical given the competition they face.
Usually less than 100 ms page service times.
Use psyco, a dynamic python->C compiler that uses a JIT compiler approach to optimize inner loops.
For high CPU intensive activities like encryption, they use C extensions.
Some pre-generated cached HTML for expensive to render blocks.
Row level caching in the database.
Fully formed Python objects are cached.
Some data are calculated and sent to each application so the values are cached in local memory. This is an underused strategy. The fastest cache is in your application server and it doesn't take much time to send precalculated data to all your servers. Just have an agent that watches for changes, precalculates, and sends.
Video Serving
Costs include bandwidth, hardware, and power consumption.
Each video hosted by a mini-cluster. Each video is served by more than one machine.
Using a a cluster means:
- More disks serving content which means more speed.
- Headroom. If a machine goes down others can take over.
- There are online backups.
Servers use the lighttpd web server for video:
- Apache had too much overhead.
- Uses epoll to wait on multiple fds.
- Switched from single process to multiple process configuration to handle more connections.
Most popular content is moved to a CDN (content delivery network):
- CDNs replicate content in multiple places. There's a better chance of content being closer to the user, with fewer hops, and content will run over a more friendly network.
- CDN machines mostly serve out of memory because the content is so popular there's little thrashing of content into and out of memory.
Less popular content (1-20 views per day) uses YouTube servers in various colo sites.
- There's a long tail effect. A video may have a few plays, but lots of videos are being played. Random disks blocks are being accessed.
- Caching doesn't do a lot of good in this scenario, so spending money on more cache may not make sense. This is a very interesting point. If you have a long tail product caching won't always be your performance savior.
- Tune RAID controller and pay attention to other lower level issues to help.
- Tune memory on each machine so there's not too much and not too little.
Serving Video Key Points
Keep it simple and cheap.
Keep a simple network path. Not too many devices between content and users. Routers, switches, and other appliances may not be able to keep up with so much load.
Use commodity hardware. More expensive hardware gets the more expensive everything else gets too (support contracts). You are also less likely find help on the net.
Use simple common tools. They use most tools build into Linux and layer on top of those.
Handle random seeks well (SATA, tweaks).
Serving Thumbnails
Surprisingly difficult to do efficiently.
There are a like 4 thumbnails for each video so there are a lot more thumbnails than videos.
Thumbnails are hosted on just a few machines.
Saw problems associated with serving a lot of small objects:
- Lots of disk seeks and problems with inode caches and page caches at OS level.
- Ran into per directory file limit. Ext3 in particular. Moved to a more hierarchical structure. Recent improvements in the 2.6 kernel may improve Ext3 large directory handling up to 100 times, yet storing lots of files in a file system is still not a good idea.
- A high number of requests/sec as web pages can display 60 thumbnails on page.
- Under such high loads Apache performed badly.
- Used squid (reverse proxy) in front of Apache. This worked for a while, but as load increased performance eventually decreased. Went from 300 requests/second to 20.
- Tried using lighttpd but with a single threaded it stalled. Run into problems with multiprocesses mode because they would each keep a separate cache.
- With so many images setting up a new machine took over 24 hours.
- Rebooting machine took 6-10 hours for cache to warm up to not go to disk.
To solve all their problems they started using Google's BigTable, a distributed data store:
- Avoids small file problem because it clumps files together.
- Fast, fault tolerant. Assumes its working on a unreliable network.
- Lower latency because it uses a distributed multilevel cache. This cache works across different collocation sites.
- For more information on BigTable take a look at Google Architecture, GoogleTalk Architecture, and BigTable.
Databases
The Early Years
- Use MySQL to store meta data like users, tags, and descriptions.
- Served data off a monolithic RAID 10 Volume with 10 disks.
- Living off credit cards so they leased hardware. When they needed more hardware to handle load it took a few days to order and get delivered.
- They went through a common evolution: single server, went to a single master with multiple read slaves, then partitioned the database, and then settled on a sharding approach.
- Suffered from replica lag. The master is multi-threaded and runs on a large machine so it can handle a lot of work. Slaves are single threaded and usually run on lesser machines and replication is asynchronous, so the slaves can lag significantly behind the master.
- Updates cause cache misses which goes to disk where slow I/O causes slow replication.
- Using a replicating architecture you need to spend a lot of money for incremental bits of write performance.
- One of their solutions was prioritize traffic by splitting the data into two clusters: a video watch pool and a general cluster. The idea is that people want to watch video so that function should get the most resources. The social networking features of YouTube are less important so they can be routed to a less capable cluster.
The later years:
- Went to database partitioning.
- Split into shards with users assigned to different shards.
- Spreads writes and reads.
- Much better cache locality which means less IO.
- Resulted in a 30% hardware reduction.
- Reduced replica lag to 0.
- Can now scale database almost arbitrarily.
Data Center Strategy
Used manage hosting providers at first. Living off credit cards so it was the only way.
Managed hosting can't scale with you. You can't control hardware or make favorable networking agreements.
So they went to a colocation arrangement. Now they can customize everything and negotiate their own contracts.
Use 5 or 6 data centers plus the CDN.
Videos come out of any data center. Not closest match or anything. If a video is popular enough it will move into the CDN.
Video bandwidth dependent, not really latency dependent. Can come from any colo.
For images latency matters, especially when you have 60 images on a page.
Images are replicated to different data centers using BigTable. Code
looks at different metrics to know who is closest.
Lessons Learned
Stall for time. Creative and risky tricks can help you cope in the short term while you work out longer term solutions.
Prioritize. Know what's essential to your service and prioritize your resources and efforts around those priorities.
Pick your battles. Don't be afraid to outsource some essential services. YouTube uses a CDN to distribute their most popular content. Creating their own network would have taken too long and cost too much. You may have similar opportunities in your system. Take a look at Software as a Service for more ideas.
Keep it simple! Simplicity allows you to rearchitect more quickly so you can respond to problems. It's true that nobody really knows what simplicity is, but if you aren't afraid to make changes then that's a good sign simplicity is happening.
Shard. Sharding helps to isolate and constrain storage, CPU, memory, and IO. It's not just about getting more writes performance.
Constant iteration on bottlenecks:
- Software: DB, caching
- OS: disk I/O
- Hardware: memory, RAID
You succeed as a team. Have a good cross discipline team that understands the whole system and what's underneath the system. People who can set up printers, machines, install networks, and so on. With a good team all things are possible.

===========================================================
PlentyOfFish Architecture
FRIDAY, JUNE 26, 2009 AT 3:58PM
Update 4: Jeff Atwood costs out Markus' scale up approach against a scale out approach and finds scale up wanting. The discussion in the comments is as interesting as the article. My guess is Markus doesn't want to rewrite his software to work across a scale out cluster so even if it's more expensive scale up works better for his needs.
Update 3: POF now has 200 million images and serves 10,000 images served per second. They'll be moving to a 250,000 IOPS RamSan to handle the load. Also upgraded to a core database machine with 512 GB of RAM, 32 CPU’s, SQLServer 2008 and Windows 2008.
Update 2: This seems to be a POF Peer1 love fest infomercial. It's pretty content free, but the production values are high. Lots of quirky sounds and fish swimming on the screen.
Update: by Facebook standards Read/WriteWeb says POF is worth a cool one billion dollars. It helps to talk like Dr. Evil when saying it out loud.

PlentyOfFish is a hugely popular on-line dating system slammed by over 45 million visitors a month and 30+ million hits a day (500 - 600 pages per second). But that's not the most interesting part of the story. All this is handled by one person, using a handful of servers, working a few hours a day, while making $6 million a year from Google ads. Jealous? I know I am. How are all these love connections made using so few resources?

Site: http://www.plentyoffish.com/

Information Sources
Channel9 Interview with Markus Frind
Blog of Markus Frind
Plentyoffish: 1-Man Company May Be Worth $1Billion
The Platform
Microsoft Windows
ASP.NET
IIS
Akamai CDN
Foundry ServerIron Load Balancer
The Stats
PlentyOfFish (POF) gets 1.2 billion page views/month, and 500,000 average unique logins per day. The peak season is January, when it will grow 30 percent.
POF has one single employee: the founder and CEO Markus Frind.
Makes up to $10 million a year on Google ads working only two hours a day.
30+ Million Hits a Day (500 - 600 pages per second).
1.1 billion page views and 45 million visitors a month.
Has 5-10 times the click through rate of Facebook.
A top 30 site in the US based on Competes Attention metric, top 10 in Canada and top 30 in the UK.
2 load balanced web servers with 2 Quad Core Intel Xeon X5355 @ 2.66Ghz), 8 Gigs of RAM (using about 800 MBs), 2 hard drives, runs Windows x64 Server 2003.
3 DB servers. No data on their configuration.
Approaching 64,000 simultaneous connections and 2 million page views per hour.
Internet connection is a 1Gbps line of which 200Mbps is used.
1 TB/day serving 171 million images through Akamai.
6TB storage array to handle millions of full sized images being uploaded every month to the site.
What's Inside
Revenue model has been to use Google ads. Match.com, in comparison, generates $300 million a year, primarily from subscriptions. POF's revenue model is about to change so it can capture more revenue from all those users. The plan is to hire more employees, hire sales people, and sell ads directly instead of relying solely on AdSense.
With 30 million page views a day you can make good money on advertising, even a 5 - 10 cents a CPM.
Akamai is used to serve 100 million plus image requests a day. If you have 8 images and each takes 100 msecs you are talking a second load just for the images. So distributing the images makes sense.
10’s of millions of image requests are served directly from their servers, but the majority of these images are less than 2KB and are mostly cached in RAM.
Everything is dynamic. Nothing is static.
All outbound Data is Gzipped at a cost of only 30% CPU usage. This implies a lot of processing power on those servers, but it really cuts bandwidth usage.
No caching functionality in ASP.NET is used. It is not used because as soon as the data is put in the cache it's already expired.
No built in components from ASP are used. Everything is written from scratch. Nothing is more complex than a simple if then and for loops. Keep it simple.
Load balancing
- IIS arbitrarily limits the total connections to 64,000 so a load balancer was added to handle the large number of simultaneous connections. Adding a second IP address and then using a round robin DNS was considered, but the load balancer was considered more redundant and allowed easier swap in of more web servers. And using ServerIron allowed advanced functionality like bot blocking and load balancing based on passed on cookies, session data, and IP data.
- The Windows Network Load Balancing (NLB) feature was not used because it doesn't do sticky sessions. A way around this would be to store session state in a database or in a shared file system.
- 8-12 NLB servers can be put in a farm and there can be an unlimited number of farms. A DNS round-robin scheme can be used between farms. Such an architecture has been used to enable 70 front end web servers to support over 300,000 concurrent users.
- NLB has an affinity option so a user always maps to a certain server, thus no external storage is used for session state and if the server fails the user loses their state and must relogin. If this state includes a shopping cart or other important data, this solution may be poor, but for a dating site it seems reasonable.
- It was thought that the cost of storing and fetching session data in software was too expensive. Hardware load balancing is simpler. Just map users to specific servers and if a server fails have the user log in again.
- The cost of a ServerIron was cheaper and simpler than using NLB. Many major sites use them for TCP connection pooling, automated bot detection, etc. ServerIron can do a lot more than load balancing and these features are attractive for the cost.
Has a big problem picking an ad server. Ad server firms want several hundred thousand a year plus they want multi-year contracts.
In the process of getting rid of ASP.NET repeaters and instead uses the append string thing or response.write. If you are doing over a million page views a day just write out the code to spit it out to the screen.
Most of the build out costs went towards a SAN. Redundancy at any cost.
Growth was through word of mouth. Went nuts in Canada, spread to UK, Australia, and then to the US.
Database
- One database is the main database.
- Two databases are for search. Load balanced between search servers based on the type of search performed.
- Monitors performance using task manager. When spikes show up he investigates. Problems were usually blocking in the database. It's always database issues. Rarely any problems in .net. Because POF doesn't use the .net library it's relatively easy to track down performance problems. When you are using many layers of frameworks finding out where problems are hiding is frustrating and hard.
- If you call the database 20 times per page view you are screwed no matter what you do.
- Separate database reads from writes. If you don't have a lot of RAM and you do reads and writes you get paging involved which can hang your system for seconds.
- Try and make a read only database if you can.
- Denormalize data. If you have to fetch stuff from 20 different tables try and make one table that is just used for reading.
- One day it will work, but when your database doubles in size it won't work anymore.
- If you only do one thing in a system it will do it really really well. Just do writes and that's good. Just do reads and that's good. Mix them up and it messes things up. You run into locking and blocking issues.
- If you are maxing the CPU you've either done something wrong or it's really really optimized. If you can fit the database in RAM do it.
The development process is: come up with an idea. Throw it up within 24 hours. It kind of half works. See what user response is by looking at what they actually do on the site. Do messages per user increase? Do session times increase? If people don't like it then take it down.
System failures are rare and short lived. Biggest issues are DNS issues where some ISP says POF doesn't exist anymore. But because the site is free, people accept a little down time. People often don't notice sites down because they think it's their problem.
Going from one million to 12 million users was a big jump. He could scale to 60 million users with two web servers.
Will often look at competitors for ideas for new features.
Will consider something like S3 when it becomes geographically load balanced.
Lessons Learned
You don't need millions in funding, a sprawling infrastructure, and a building full of employees to create a world class website that handles a torrent of users while making good money. All you need is an idea that appeals to a lot of people, a site that takes off by word of mouth, and the experience and vision to build a site without falling into the typical traps of the trade. That's all you need :-)
Necessity is the mother of all change.
When you grow quickly, but not too quickly you have a chance grow, modify, and adapt.
RAM solves all problems. After that it's just growing using bigger machines.
When starting out keep everything as simple as possible. Nearly everyone gives this same advice and Markus makes a noticeable point of saying everything he does is just obvious common sense. But clearly what is simple isn't merely common sense. Creating simple things is the result of years of practical experience.
Keep database access fast and you have no issues.
A big reason POF can get away with so few people and so little equipment is they use a CDN for serving large heavily used content. Using a CDN may be the secret sauce in a lot of large websites. Markus thinks there isn't a single site in the top 100 that doesn’t use a CDN. Without a CDN he thinks load time in Australia would go to 3 or 4 seconds because of all the images.
Advertising on Facebook yielded poor results. With 2000 clicks only 1 signed up. With a CTR of 0.04% Facebook gets 0.4 clicks per 1000 ad impressions, or .4 clicks per CPM. At 5 cent/CPM = 12.5 cents a click, 50 cent/CPM = $1.25 a click. $1.00/CPM = $2.50 a click. $15.00/CPM = $37.50 a click.
It's easy to sell a few million page views at high CPM’s. It's a LOT harder to sell billions of page views at high CPM’s, as shown by Myspace and Facebook.
The ad-supported model limits your revenues. You have to go to a paid model to grow larger. To generate 100 million a year as a free site is virtually impossible as you need too big a market.
Growing page views via Facebook for a dating site won't work. Having a visitor on you site is much more profitable. Most of Facebook's page views are outside the US and you have to split 5 cent CPM’s with Facebook.
Co-req is a potential large source of income. This is where you offer in your site's sign up to send the user more information about mortgages are some other product.
You can't always listen to user responses. Some users will always love new features and others will hate it. Only a fraction will complain. Instead, look at what features people are actually using by watching your site.

=================================================================

MySpace Architecture
THURSDAY, FEBRUARY 12, 2009 AT 1:28AM
Update:Presentation: Behind the Scenes at MySpace.com. Dan Farino, Chief Systems Architect at MySpace shares details of some of MySpace's cool internal operations tools.

MySpace.com is one of the fastest growing site on the Internet with 65 million subscribers and 260,000 new users registering each day. Often criticized for poor performance, MySpace has had to tackle scalability issues few other sites have faced. How did they do it?

Site: http://myspace.com

Information Sources
Presentation: Behind the Scenes at MySpace.com
Inside MySpace.com
Platform
ASP.NET 2.0
Windows
IIS
SQL Server
What's Inside?
300 million users.
Pushes 100 gigabits/second to the internet. 10Gb/sec is HTML content.
4,500+ web servers windows 2003/IIS 6.0/APS.NET.
1,200+ cache servers running 64-bit Windows 2003. 16GB of objects cached in RAM.
500+ database servers running 64-bit Windows and SQL Server 2005.

MySpace processes 1.5 Billion page views per day and handles 2.3 million concurrent users during the day
Membership Milestones:
- 500,000 Users: A Simple Architecture Stumbles
- 1 Million Users:Vertical Partitioning Solves Scalability Woes
- 3 Million Users: Scale-Out Wins Over Scale-Up
- 9 Million Users: Site Migrates to ASP.NET, Adds Virtual Storage
- 26 Million Users: MySpace Embraces 64-Bit Technology
500,000 accounts was too much load for two web servers and a single database.
At 1-2 Million Accounts
- They used a database architecture built around the concept of vertical partitioning, with separate databases for parts of the website that served different functions such as the log-in screen, user profiles and blogs.
- The vertical partitioning scheme helped divide up the workload for database reads and writes alike, and when users demanded a new feature, MySpace would put a new database online to support it.
- MySpace switched from using storage devices directly attached to its database servers to a storage area network (SAN), in which a pool of disk storage devices are tied together by a high-speed, specialized network, and the databases connect to the SAN. The change to a SAN boosted performance, uptime and reliability.
At 3 Million Accounts
- the vertical partitioning solution didn't last because they replicated some horizontal information like user accounts across all vertical slices. With so many replications one would fail and slow down the system.
- individual applications like blogs on sub-sections of the Web site would grow too large for a single database server
- Reorganized all the core data to be logically organized into one database
- split its user base into chunks of 1 million accounts and put all the data keyed to those accounts in a separate instance of SQL Server
9 Million–17 Million Accounts
- Moved to ASP.NET which used less resources than their previous architecture. 150 servers running the new code were able to do the same work that had previously required 246.
- Saw storage bottlenecks again. Implementing a SAN had solved some early performance problems, but now the Web site's demands were starting to periodically overwhelm the SAN's I/O capacity—the speed with which it could read and write data to and from disk storage.
- Hit limits with the 1 million-accounts-per-database division approach as these limits were exceeded.
- Moved to a virtualized storage architecture where the entire SAN is treated as one big pool of storage capacity, without requiring that specific disks be dedicated to serving specific applications. MySpace now standardized on equipment from a relatively new SAN vendor, 3PARdata
Added a caching tier—a layer of servers placed between the Web servers and the database servers whose sole job was to capture copies of frequently accessed data objects in memory and serve them to the Web application without the need for a database lookup.
26 Million Accounts
- Moved to 64-bit SQL server to work around their memory bottleneck issues. Their standard database server configuration uses 64 GB of RAM.
Horizontally Federated Database. Databases are partition by purpose. Have profile, email databases etc. Partition is based on user range. 1 Million users live in each database. So you have Profile1, Profile2 all the way up to Profile300 as they have 300 million users.
Doesn't use ASP cache because they don't have a high enough hit rate on the front-end. The middle tier cache does have a high hit rate.
Failure isolation. Segment requests into web server by database. Allow only 7 threads per database. So if the database is slow only those threads will slowdown and the traffic in the other threads will flow.
Operations
PerfCollector. Centralized collection of performance data via UDP. More reliable than Windows and allows any client to connect and see stats.
Web Based Stack Dump Tool. Can right-click on a problem server and get stack dump of the .Net managed threads. Used to have to RDC into system and attach a debugger and 1/2 later get an answer. Slow, nonscalable, and tedious. Not just a stack dump, gives a lot of context about what the thread is doing. Troubleshooting is easier because you can see 90 threads are blocked on a database so the database may be down.
Web Base Heap Dump Tool. Dumps all memory allocations. Very useful for developers. Save hours of doing it by hand.
Profiler. Traces a request from start to finish and produces a report. See URL, methods, status, everything that will help you identify a slow request. Looks at lock contentions, are a lot of exceptions being thrown, anything that might be interesting. Very light weight. It's running on one box in every VIP (group of 100 servers) in production. Samples 1 thread every 10 seconds. Always tracing in background.
Powershell. Microsoft's new shell that runs in process and pass objects between commands versus parsing text output. MySpace develops a lot of commandlets to support operations.
Developed their own asynchronous communication technology to get around windows networking problems and treat servers as a group. Can ship a .cs file, compile it, run it, and ship the response back.
Codespew. Pushes code updates on their communication technology. Used to do 5 code pushes a day, now down to 1 a week.
Lessons Learned
You can build big websites using Microsoft tech.
A cache should have been used from the beginning.
The cache is a better place to store transitory data that doesn't need to be recorded in a database, such as temporary files created to track a particular user's session on the Web site.
Built in OS features to detect denial of service attacks can cause inexplicable failures.
Distribute your data to geographically diverse data centers to handle power failures.
Consider using virtualized storage/clustered file systems from the start. It allows you to massively parallelize IO access while being able to add disk as needed without any reorganization needed.
Develop tools that work in a production environment. Can't simulate everything in test environment. The scale and variety of uses APIs are put to can't be simulated in QA during testing. Legitimate users and hackers will run into corner cases that weren't hit in testing, though QA will find most of the problems.
Throw hardware at problems. Easier than changing their backend software to a new way of doing things. The example is they add a new database server for every million users. It might be more efficient to change their approach to more efficiently use the database hardware, but it's easier just to add servers. For now.

piccolo: 另一個分散式的作法

據說開發時間與執行速度都快於Hadoop

http://piccolo.news.cs.nyu.edu/

Distributed File System

Hadoop
Kosmos
MogileFS

Web Server

lighttpd
ngix

Proxy Server

Squird
varnish: 據說很快?