NitroWare.net

Please standby while the website is under maintenance. All existing content is still available to access.

 

A technical re-cap on the Intel Atom clock signal issue

Under heavy use, the signal on one of the buses degrades faster than normal and between 18 months and 3 years, the CPU will no longer boot, especially after a power outage/cycle.

Failure prevention is not to hard reboot/power cycle an aging device excessively.

Intel documented this issue as their errata AVR50 and later AVR54, which is listed in their January 2017 specification update document for their Atom Processor C2000 Family. A workaround is available but the issue was permanently fixed in a new revision of the chipset. Intel maintain these documents for all their products, continuously listing new erratum and their fixes.

 errata table from intel atom C2000 specification update

intel atom c2000 rangeley AVR54 erratta

 

Do these failure scenarios sound like “Epic failures”? Yes they do.

 

rangeley intel block diagram

 

The ‘legacy block’ includes buses such as GPIO and LPC. General Purpose I/O lines are used in system-on-chip designs to interface with buttons, switches and displays that will trigger events in the processor such as standby, reset or wireless.

For example, the LEDs and wireless buttons on a typical router are enabled this way and development boards for makers and hobbyists expose many of these.

Low Pin Count is a 33 MHz bus line used to connect the Intel Platform chipset to auxiliaries such as The Flash ROM that contains the System BIOS/Firmware and controller chips that may can legacy connectivity such as Serial and Parallel Ports, Floppy Disk and PS/2 Keyboard, mouse. USB is natively available in the system chipset and is completely separate.

seagate nas pro motherboard intel atom c2000 rangeley

 

Netgate (not to be confused with Netgear) is a semi niche networking vendor who produces networking devices such as firewalls that run the popular BSD based pfSense operating system. Some of their devices used the Atom C2000 chip but they chose to implement it their own way using n open source firmware approach and as such avoided the trap of the faulty LPC bus, that would otherwise connect a serial Flash EEPROM that would hold the BIOS code.

https://www.netgate.com/blog/clock-signal-component-issue.html

 

To quote an administrator on the pfSense forums

“Not using the LPC bus is…unusual for an Intel design.  We have zero need for it so we didn’t use it”

https://forum.pfsense.org/index.php?topic=125105.msg699015#msg699015

intel atom c2000 rangeley pfsense forum post lpc bus netgate

 

Using information gathered from Netgate, ADI (ODM board manufacturer of embedded and industrial PC motherboards for vendors like Netgate) and Intel documents, The workaround for errata AVR 50 and 54 seems to involves BIOS updates to modify how the serial communication and interrupts behave on the LPC bus, specific to the particular board in the affected device. Additionally, a hardware rework on those affected devices reduces the electrical load/strength of the signals on the LPC bus and therefore reduces stress on the part.

Not all devices in the field have the ability to have their system BIOS updated. Just because a vendor may offer a firmware or OS update for say a black box firewall, router or switch, that update is typically for the main software operating system. The operating system would need access to the system bios in order to flash, or update it. This functionality is often deliberately reserved for factory use only in order to prevent faults and preserve ‘security’ plus theres the engineering work required by the vendor to ensure that mechanism works. A manufacturer may wish to not implement it for a function that may never be used in the life of the product as they forsee their BIOS implementation as stripped down and basic, which ‘does not need’ to be ever updated…

I need to be clear though, due to the silicon lottery and the sometimes obscure nature of the errata that occur in chips. Just because you own an affected chip does not mean it will fail, it’s more that it is very likely to fail. Vendors tend to claim otherwise but stacks of faulty product that appear some time, often years after the fact with the same issues is more evidence of the truth.

 

 

Vendors will also tout industry standard or sub industry standard failure rates but there is no objective way to verify those specific claims. Those within the industry with a exposure to multiple examples of a device such as administrators or repairs would have a better grasp of the status quo regarding a faulty device.

If those numbers are so low, then what are the odds that outside the clock signal issue as a thing, that one of the switches I administer has a fault with its flash memory card? Funnily enough this issue can be related to the clock signal bug as storage can be connected via the affected bus lines.