Message: previous - next
Month: October 2015

Master server status

From: "Timothy Pearson" <kb9vqf@...>
Date: Thu, 8 Oct 2015 11:27:46 -0500
Hash: SHA224


As Lisi alluded to earlier the TDE master server is continuing to
experience issues causing sporadic service outages.  I believe I have
traced this fault to a defective CPU package; this is the first time in
well over a decade that I have actually seen a defective CPU, but the
certainty of the diagnosis has grown sufficiently that I have ordered a

I have dsabled some secondary services to try to reduce overall system
load in the hopes that this will stabilize the remaining services until
the replacement parts arrive.  You can follow along on the status of the
repairs at this page (when available):

Technical details:
The TDE project makes heavy use of a rather beefy server containing G34
Opteron processors (i.e. in the $1,000 USD range _per CPU package_).  I
started to see various MCEs (not related to DRAM) and, much more commonly,
lockups several weeks ago but assumed it was a power stability issue. 
Unfortunately, even after swapping the PSUs the lockups are obviously
continuing, and becoming more frequent.

My best guess is that this particular processor has developed a somewhat
unstable L2 / L3 cache; the MCEs that I did log were similar to this:
[Hardware Error]: MC2 Error: VB Data ECC or parity error.
[Hardware Error]: Error Status: Corrected error, no action required.
[Hardware Error]: CPU:6 (15:2:0) MC2_STATUS[Over|CE|MiscV|-|-|-|-|CECC]:
[Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV

While it would be possible to completely shut down QuickBuild (thus taking
much of the load off of the affected servers) I believe this would be
detrimental to the long term existence of the project.  In particular, we
would lose all tinderboxing; I would effectively need to pick a specific
version of Debian, make sure that a specific release of TDE works on that,
and ignore all the rest.  Also, the release interval would jump back up
into the multi-year range due to the difficulty involved in manually
assembling all of the requisite repository files.  There are other
somewhat obvious drawbacks as well, but just those two alone would
probably kill the project.

I also don't reasonably see how QuickBuild's services can be replaced by
anything free in the cloud.  The built TDE packages occupy several hundred
gigabytes of disk space, and can easily hog dozens of build machines on
multiple architectures.  A long time ago (back when TDE only supported a
couple of Ubuntu versions) I used the free Launchpad build service, but
this project rapidly outgrew that service.  It rather impedes development
to have your rebuild take months to work through the public build queue...

So, to summarize: QuickBuild requires powerful servers, lots of power, and
loads of disk space.  It is also somewhat essential to TDE's somewhat
rapid release schedule, serving as a QA check and release management
platform.  Unfortunately that means the annual cost to run the TDE
services is very high, and I don't always have the funding to eliminate
all sources of downtime.

Thank you to those few that have donated over the years; it has helped in
a small way to keep TDE alive.  As Lisi said, if everyone donated annually
a relatively small amount there would be no real financial concerns and
both I and Slavek would have more time to actually work on the reason we
are all here -- TDE itself!

Thank you,

Timothy Pearson
Trinity Desktop Project
Version: GnuPG v1.4.11 (GNU/Linux)