Databases (Transcript)

This is the transcript for the Databases session.

This discussion includes a slideshow presentation with outline which will be available with the transcript online.

General outline:

  1. Slow things get killed
  1. Other choice: Split it
  • Split #1: slavery
  • Split #2: Babel
  • Split #3: guilds
  • Split #4: delay
Q: what kinds of things end up in job queues?
A: *couldn't keep up*
  1. LoadBalancer.php
has multiple setups, can organize by language, can weight per database
  1. Backups
Q: we don't expose this to users, right?
A: we'd have to explore some way of screening, "expose this bit, don't
expose this bit," *couldn't keep up*


  1. Object cache
  1. Compression
  1. latest problems
  • App servers lock up
  • hardware issues


More problems:

No data on site usage.
Would be nice if we could use such data to provide information to developers
and engineers with information about how to design solutions to problems of
community, how to organize community to design solutions to engineering problems.

Profiling system:

Problem, profiling produces lots of data, which is hard to analyze.
Analyzing profiling output can take a longgg time. (~5 hrs)
Solutions:
Profiling is done for short periods, as apache can produce tremendous amounts
of data (gigabytes) per minute of traces.
(Ivan:) you profile just one (or a handful) of testbed machines in the cluster.
this can be used to keep good, up-to-date profiling of system performance
Also, this solves the problem of enabling profiling on many machines, then disabl ing it.


Question: The check user, the user name lookup is seriously slow, I think it actually has to iterate between IP...
Answer: To make this faster we can tell mysql to generate an index on user ips, so that search is way faster.
Q: Can I expect this to happen in the near future?
A: Yes, especially if you email us about it
Q:My main question is regarding resilliancy- if I were to say, restart database 4, would it automatically be detected, or ? When it comes back up, it still requires manual replication. What can we do...
A:We allow quite a lot of error to happen just to make the site work. Whatever.
Q: is there anything we can add to make it more resilliant?
A:


Q: It takes manual intervention to set up a replication server. Does it take
manual intervention to make sure that all servers aren't trying to reach a
database server which goes down?
A: Not a big deal, the load balancer notices when a server goes down and
doesn't push requests to it.
Q: en wiki is running separately from other wikis, its storage is external?
A: Yes.
Q: Is?
A: Yes.
Q: I was curious about the toolserver- you say it takes hours to replicate over the internet?
A: it lags because it's loaded, because people write stupid queries.
Q: What bandwidth does it take to replicate?
A: less than a megabit.
Q: would it benefit to have a toolserver on this side of the pond?
A: I'm not sure it benefits to have a toolserver.
A: I'm not sure it matters where it's physically located.
Q: what is the relative load of search compared to everything else?
A: we get an order of magnitude (or a couple) less than page views. It may be
increasing.