Further these data centers also run,
big data processing tools, such as Hadoop, Spark,
Dryad, Databases,
which all work to process information
and make it available to these web applications.
These data processing frameworks
can move massive amounts of data around.
So what does the actual measured traffic
and data centers look like
when they're running these applications?
The short answer unfortunately,
even though it seems like a cop-out is, it depends.
On the applications that you're running,
on the scale you're running them at,
and the network's design as well as the design
of the applications.
But nevertheless, let's take a look
at some of the published data.
One thing that's unambiguously true
is traffic volume inside data centers is growing rapidly,
and is the majority of traffic that these servers see.
The majority of the traffic is not,
to and from the Internet but, is rather,
inside the data center.
Here what you're looking at is data from Google
showing the traffic generated by their servers,
and their datacenters over a period of time,
a bit more than six years which is on the X-axis.
And on the Y-axis is the aggregate traffic volume.
The absolute numbers are not available,
but over a six-year period
the data volume grows by 50 times.
Google has noted that this e-traffic doubling every year.
Google's paper is not quite clear about
whether this is data center internal traffic only
but Facebook has mentioned that machine to machine traffic
is several orders of magnitude larger
than what goes out to the Internet.
So really, most of the traffic in these facilities is going
to be within the facility, as opposed to
to and from the Internet,
and it's growing quite rapidly.
So what does this traffic look like?
One question we might want to have answered
is about locality.
Do machines communicate with neighboring machines,
or is traffic uniformly spread
throughout the data center or such?
So here we have some data from Facebook
that goes some way towards addressing this question.
Let's focus on this part of the table,
where all of Facebook's data center traffic
is partitioned by locality.
Within rack, within a cluster, within a data center,
or across data centers.
As you can see roughly 13% of the traffic is within a rack.
So these are just machines within the rack
talking to each other.
A rack might host some tens of machines, for example,
40 machines seems to be quite common.
Then we see that there is 58% of traffic,
which stays within a cluster but not within the rack.
So this traffic is across racks within a cluster.
Further, 12% of the traffic stays within the data center,
but it not cluster local.
So this is traffic,
between multiple clusters in the data center.
Also interesting is that 18% of the traffic
is between data centers.
This is actually larger than the rack local traffic.
So locality in this work load is not really high.
Also worth noting, is that Hadoop
is the single largest driver of traffic
in Facebook data centers.
More data at rack locality comes from Google.
What you're looking at here
is data from 12 blocks of servers.
Blocks are groups of racks,
so a block might have a few hundred servers.
This is a smaller granularity than a cluster,
but a larger granularity than a rack.
Here, what you're looking at, in the figure on the right
is traffic, or a block, that is leaving for other blocks,
so that is non-local traffic.
For each of these 12 blocks in the figure,
you see that most of the traffic is non-local,
so most of the traffic goes to other blocks.
Now there are 12 blocks here.
If traffic was uniformly distributed,
you would see 1/12 of the traffic being local,
and 11/12, that is roughly 90%, being non-local,
which is exactly what this graph shows.
Part of this definitely stems
from how Google organizes storage.
The paper notes, that for great availability,
they spread data around different fall domains.
So for example, a block might have all its power supply
from one source.
If that power supply fails,
you lose everything in that block.
So if data is spread around well, over multiple blocks,
you might still have the service be available.
So this kind of organization is good for availability,
but is bad for locality.
Another set of measurements comes from Benson et al.,
who evaluated three university clusters,
two private enterprise networks,
and five commercial cloud networks.
The paper obscures who the cloud provider is,
but each of these datacenters host 10,000 or more servers
running applications including web search,
and one of the authors works at Microsoft.
The university and private data centers
have a few hundred to 2,000 servers each.
While the cloud data centers here, have 10 to 15,000 servers.
The cloud data centers one through three,
run many applications including web, mail, etcetera,
but cloud four and five run more MapReduce style workloads.
One thing that's worth noticing is,
the amount of rack-locality here is much larger.
70% or so for the cloud data centers,
which is very different from what we saw earlier
for Google and Facebook measurements.
There're many possible reasons for these differences.
For one, the workloads might be different.
Not even all MapReduce jobs are the same.
It's entirely possible that Google and Facebook
run large MapReduce tasks which do not fit in a rack.
While this mystery data center runs smaller tasks
that do fit in a rack, and, hence,
traffic is mostly rack-local.
There might also just be different ways
of organizing storage and compute.
There's also a five-year gap
between the publishing of these measurements.
Perhaps, things are just different now,
hap sizes might have grown substantially,
or people have changed how they do these things.
Having looked at locality, let's turn our attention
to flow level characteristics.
How many flows does a server see concurrently?
Facebook's measurements, and I quote here: