Databases (Transcript)
This is the transcript for the Databases session.
This discussion includes a slideshow presentation with outline which will be available with the transcript online.
General outline:
- Slow things get killed
- Other choice: Split it
- Split #1: slavery
- Split #2: Babel
- Split #3: guilds
- Split #4: delay
- Q: what kinds of things end up in job queues?
- A: *couldn't keep up*
- LoadBalancer.php
- has multiple setups, can organize by language, can weight per database
- Backups
- Q: we don't expose this to users, right?
- A: we'd have to explore some way of screening, "expose this bit, don't
- expose this bit," *couldn't keep up*
- Object cache
- Compression
- latest problems
- App servers lock up
- hardware issues
More problems:
- No data on site usage.
- Would be nice if we could use such data to provide information to developers
- and engineers with information about how to design solutions to problems of
- community, how to organize community to design solutions to engineering problems.
Profiling system:
- Problem, profiling produces lots of data, which is hard to analyze.
- Analyzing profiling output can take a longgg time. (~5 hrs)
- Solutions:
- Profiling is done for short periods, as apache can produce tremendous amounts
- of data (gigabytes) per minute of traces.
- (Ivan:) you profile just one (or a handful) of testbed machines in the cluster.
- this can be used to keep good, up-to-date profiling of system performance
- Also, this solves the problem of enabling profiling on many machines, then disabl ing it.
- Question: The check user, the user name lookup is seriously slow, I think it actually has to iterate between IP...
- Answer: To make this faster we can tell mysql to generate an index on user ips, so that search is way faster.
- Q: Can I expect this to happen in the near future?
- A: Yes, especially if you email us about it
- Q:My main question is regarding resilliancy- if I were to say, restart database 4, would it automatically be detected, or ? When it comes back up, it still requires manual replication. What can we do...
- A:We allow quite a lot of error to happen just to make the site work. Whatever.
- Q: is there anything we can add to make it more resilliant?
- A:
- Q: It takes manual intervention to set up a replication server. Does it take
- manual intervention to make sure that all servers aren't trying to reach a
- database server which goes down?
- A: Not a big deal, the load balancer notices when a server goes down and
- doesn't push requests to it.
- Q: en wiki is running separately from other wikis, its storage is external?
- A: Yes.
- Q: Is?
- A: Yes.
- Q: I was curious about the toolserver- you say it takes hours to replicate over the internet?
- A: it lags because it's loaded, because people write stupid queries.
- Q: What bandwidth does it take to replicate?
- A: less than a megabit.
- Q: would it benefit to have a toolserver on this side of the pond?
- A: I'm not sure it benefits to have a toolserver.
- A: I'm not sure it matters where it's physically located.
- Q: what is the relative load of search compared to everything else?
- A: we get an order of magnitude (or a couple) less than page views. It may be
- increasing.