SonOTA – Flashing Itead Sonoff devices via original OTA mechanism

Long story short

There’s now a script with which you can flash your sonoff device via the original internal OTA upgrade mechanism, meaning, no need to open, solder, etc. the device to get your custom firmware onto it.

This isn’t perfect (yet) — please mind the issues at the end of this post!

https://github.com/mirko/SonOTA

Credits

First things first: Credits!
The problem with credits is you usually forget somebody and that’s most likely happening here as well.
I read around quite a lot, gathered information and partially don’t even remember anymore where I read what (first).

Of course I’m impressed by the entire Tasmota project and what it enables one to do with the Itead Sonoff and similar devices.

Special thanks go to khcnz who helped me a lot in a discussion documented here.

I’d also like to mention Richard Burtons, who I didn’t interact with directly but only read his blog. That guy apparently was too bored by all the amazing tech stuff he was doing for a living, so he took a medical degree and is now working as a doctor, has a passion for horology (meaning, he’s building a turrot clock), is sailing regattas with his own rs200, decompiles and reverse-engineers proprietary bootloaders in his spare time and writes a new bootloader called rboot for the ESP8266 as a side project.

EDIT: Jan Almeroth already reversed some of the protocol in 2016 and also documented the communication between the proprietary EWeLink app and the AWS cloud. Unfortunately I only became aware of that great post after I already finished mine.

Introduction Sonoff devices

Quite recently the Itead Sonoff series — a bunch of ESP8266 based IoT homeautomation devices — was brought to my attention.

The ESP8266 is a low-power consumption SoC especially designed for IoT purposes. It’s sold by Espressif, running a 32-Bit processor featuring the Xtensa instruction set (licensed from Tensilica) and having an ASIC IP core and WiFi onboard.

Those Sonoff devices using this SoC basically expect high voltage input, therewith having an AC/DC (5V) converter, the ESP8266 SoC and a relais switching the high voltage output.
They’re sold as wall switches (“Sonoff Touch”), E27 socket adapters (“Slampher”), power sockets (“S20 smart socket”) or as just — that’s most basic cheapest model — all that in a simple case (“Sonoff Basic”).
They also have a bunch of sensoric devices, measuring temperature, power comsumption, humidty, noise levels, fine dust, etc.

Though I’m rather sceptical about the whole IoT (development) philosophy, I always was (and still am) interested into low-cost and power-saving home automation which is completely and exclusively under my control.

That implies I’m obviously not interested in some random IoT devices being necessarily connected to some Google/Amazon/Whatever cloud, even less if sensible data is transmitted without me knowing (but very well suspecting) what it’s used for.

Guess what the Itead Sonoff devices do? Exactly that! They even feature Amazon Alexa and Google Nest support! And of course you have to use their proprietary app to confgure and control your devices — via the Amazon cloud.

However, as said earlier, they’re based on the ESP8266 SoC, around which a great deal of OpenSource projects evolved. For some reason especially the Arduino community pounced on that SoC, enabling a much broader range of people to play around with and program for those devices. Whether that’s a good and/or bad thing is surely debatable.

I’ll spare you the details about all the projects I ran into, there’s plenty of cool stuff out there.

I decided to go for the Sonoff-Tasmota project which is quite actively developed and supports most of the currently available Sonoff devices.

It provides an HTTP and MQTT interface and doesn’t need any connection to the internet at all. As MQTT sever (in MQTT speech called broker) I use mosquitto which I’m running on my OpenWrt WiFi router.

Flashing custom firmware (via serial)

Flashing your custom firmware onto those devices however always requires opening them, soldering a serial cable, pulling GPIO0 down to get the SoC into programming mode (which, depending on the device type, again involes soldering) and then flash your firmware via serial.

Side note: Why do all those projects describing the flashing procedure name an “FTDI serial converter” as a requirement? Every serial TTL converter does the job.
And apart from that FTDI is not a product but a company, it’s a pretty shady one. I’d just like to remind of the “incident” where FTDI released new drivers for their chips which intentionally bricked clones of their converters.

How to manually flash via serial — even though firmware replacement via OTA (kinda) works now, you still might want unbrick or debug your device — the Tasmota wiki provides instructions for each of the supported devices.

Anyway, as I didn’t want to open and solder every device I intend to use, I took a closer look at the original firmware and its OTA update mechanism.

Protocol analysis

First thing after the device is being configured (meaning, the device got configured by the proprietary app and is therewith now having internet access via your local WiFi network) is to resolve the hostname `eu-disp.coolkit.cc` and attempt to establish a HTTPS connection.

Though the connection is SSL, it doesn’t do any server certificate verification — so splitting the SSL connection and *man-in-the-middle it is fairly easy.

As a side effect I ported the mitm project sslsplit to OpenWrt and created a seperate “interception”-network on my WiFi router. Now I only need to join that WiFi network and all SSL connections get split, its payload logged and being provided on an FTP share. Intercepting SSL connections never felt easier.

Back to the protocol: We’re assuming at this point the Sonoff device was already configured (e.g. by the official WeLink app) which means it has joined our WiFi network, acquired IP settings via DHCP and has access to the internet.

The Sonoff device sends a dispatch call as HTTPS POST request to eu-disp.coolkit.cc including some JSON encoded data about itself:


POST /dispatch/device HTTP/1.1
Host: eu-disp.coolkit.cc
Content-Type: application/json
Content-Length: 152

{
  "accept":     "ws;2",
  "version":    2,
  "ts":         119,
  "deviceid":   "100006XXXX",
  "apikey":     "6083157d-3471-4f4c-8308-XXXXXXXXXXXX",
  "model":      "ITA-GZ1-GL",
  "romVersion": "1.5.5"
}

It expects an also JSON encoded host as an answer

HTTP/1.1 200 OK
Server: openresty
Date: Mon, 15 May 2017 01:26:00 GMT
Content-Type: application/json
Content-Length: 55
Connection: keep-alive

{
  "error":  0,
  "reason": "ok",
  "IP":     "52.29.48.55",
  "port":   443
}

which is used to establish a WebSocket connection

GET /api/ws HTTP/1.1
Host: iotgo.iteadstudio.com
Connection: upgrade
Upgrade: websocket
Sec-WebSocket-Key: ITEADTmobiM0x1DaXXXXXX==
Sec-WebSocket-Version: 13


HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: q1/L5gx6qdQ7y3UWgO/TXXXXXXA=

which consecutively will be used for further interchange.
Payload via the established WebSocket channel continues to be encoded in JSON.
The messages coming from the device can be classified into action-requests initiated by the device (which expect ackknowledgements by the server) and acknowledgement messages for requests initiated by the server.

The first requests are action-requests coming from the device:

1) action: register

{
  "userAgent":  "device",
  "apikey":     "6083157d-3471-4f4c-8308-XXXXXXXXXXXX",
  "deviceid":   "100006XXXX",
  "action":     "register",
  "version":    2,
  "romVersion": "1.5.5",
  "model":      "ITA-GZ1-GL",
  "ts":         712
}

responded by the server with
{
  "error":       0,
  "deviceid":   "100006XXXX",
  "apikey":     "85036160-aa4a-41f7-85cc-XXXXXXXXXXXX",
  "config": {
    "hb":         1,
    "hbInterval": 145
  }
}

As can be seen, action-requests initiated from server side also have an apikey field which can be — as long its used consistently in that WebSocket session — any generated UUID but the one used by the device.

2) action: date

{
  "userAgent":  "device",
  "apikey":     "85036160-aa4a-41f7-85cc-XXXXXXXXXXXX",
  "deviceid":   "100006XXXX",
  "action"      :"date"
}

responded with
{
  "error":      0,
  "deviceid":   "100006XXXX",
  "apikey":     "85036160-aa4a-41f7-85cc-XXXXXXXXXXXX",
  "date":       "2017-05-15T01:26:01.498Z"
}

Pay attention to the date format: it is some kind ISO 8601 but the parser is really picky about it. While python’s datetime.isoformat() function e.g. returns a string taking microseconds into account, the parser on the device will just fail parsing that. It also always expects the actually optional timezone being specified as UTC and only as a trailing Z (though according to the spec “00:00” would be valid as well).

3) action: update — the device tells the server its switch status, the MAC address of the accesspoint it is connected to, signal quality, etc.
This message also appears everytime the device status changes, e.g. it got switched on/off via the app or locally by pressing the button.

{
  "userAgent":      "device",
  "apikey":         "85036160-aa4a-41f7-85cc-XXXXXXXXXXXX",
  "deviceid":       "100006XXXX",
  "action":         "update",
  "params": {
    "switch":         "off",
    "fwVersion":      "1.5.5",
    "rssi":           -41,
    "staMac":         "5C:CF:7F:F5:19:F8",
    "startup":        "off"
  }
}

simply acknowlegded with
{
  "error":      0,
  "deviceid":   "100006XXXX",
  "apikey":     "85036160-aa4a-41f7-85cc-XXXXXXXXXXXX"
}

4) action: query — the device queries potentially configured timers
{
  "userAgent":  "device",
  "apikey":     "85036160-aa4a-41f7-85cc-XXXXXXXXXXXX",
  "deviceid":   "100006XXXX",
  "action":     "query",
  "params": [
    "timers"
  ]
}

as there are no timers configured the answer simply contains a "params":0 KV-pair
{
  "error":      0,
  "deviceid":   "100006XXXX",
  "apikey":     "85036160-aa4a-41f7-85cc-XXXXXXXXXXXX",
  "params":     0
}

That’s it – that’s the basic handshake after the (configured) device powers up.

Now the server can tell the device to do stuff.

The sequence number is used by the device to acknowledge particular action-requests so the response can be mapped back to the actual request. It appears to be a UNIX timestamp with millisecond precision which doesn’t seem like the best source for generating a sequence number (duplicates, etc.) but seems to work well enough.

Let’s switch the relais:

{
  "action":     "update",
  "deviceid":   "100006XXXX",
  "apikey":     "85036160-aa4a-41f7-85cc-XXXXXXXXXXXX",
  "userAgent":  "app",
  "sequence":   "1494806715179",
  "ts":         0,
  "params": {
    "switch":     "on"
  },
  "from":       "app"
}

{
  "action":     "update",
  "deviceid":   "100006XXXX",
  "apikey":     "85036160-aa4a-41f7-85cc-XXXXXXXXXXXX",
  "userAgent":  "app",
  "sequence":   "1494806715193",
  "ts":         0,
  "params": {
    "switch":     "off"
  },
  "from":       "app"
}

As mentioned earlier, each action-request is responded with proper acknowledgements.

And — finally — what the server now also is capable doing is to tell the device to update itself:

{
  "action":     "upgrade",
  "deviceid":   "100006XXXX",
  "apikey":     "85036160-aa4a-41f7-85cc-XXXXXXXXXXXX",
  "userAgent":  "app",
  "sequence":   "1494802194654",
  "ts":         0,
  "params": {
    "binList":[
      {
        "downloadUrl":  "http://52.28.103.75:8088/ota/rom/xpiAOwgVUJaRMqFkRBsoI4AVtnozgwp1/user1.1024.new.2.bin",
        "digest":       "1aee969af1daf96f3f120323cd2c167ae1aceefc23052bb0cce790afc18fc634",
        "name":         "user1.bin"
      },
      {
        "downloadUrl":  "http://52.28.103.75:8088/ota/rom/xpiAOwgVUJaRMqFkRBsoI4AVtnozgwp1/user2.1024.new.2.bin",
        "digest":       "6c4e02d5d5e4f74d501de9029c8fa9a7850403eb89e3d8f2ba90386358c59d47",
        "name":         "user2.bin"
      }
    ],
    "model":    "ITA-GZ1-GL",
    "version":  "1.5.5",
  }
}

After successful download and verification of the image’s checksum the device returns:
{
  "error":      0,
  "userAgent":  "device",
  "apikey":     "85036160-aa4a-41f7-85cc-XXXXXXXXXXXX",
  "deviceid":   "100006XXXX",
  "sequence":   "1495932900713"
}

The downloadUrl field should be self-explanatory (the following HTTP GET request to those URLs contain some more data as CGI parameters which however can be ommitted).

The digest is a sha256 hash of the file and the name is the partition onto which the file should be written on.

Implementing server side

After some early approaches I decided to go for a Python implementation using the tornado webserver stack.
This decision was mainly based on it providing functionality for HTTP (obviously) as well as websockets and asynchronous handling of requests.

The final script can be found here: https://github.com/mirko/SonOTA

==> Trial & Error

1st attempt

As user1.1024.new.2.bin and user2.20124.new.2.bin almost look the same, let’s just use the same image for both, in this case a tasmota build:

MOEP! Boot fails.

Reason: The tasmota build also contains the bootloader which the Espressif OTA mechanism doesn’t expect being in the image.

2nd attempt

Chopping off the first 0x1000 bytes which contain the bootloader plus padding (filled up with 0xAA bytes).

MOEP! Boot fails.

Boot mode 1 and 2 / v1 and v2 image headers

The (now chopped) image and the original upgrade images appear to have different headers — even the very first byte (the files’ magic byte) differ.

The original image starts with 0xEA while the Tasmota build starts with 0xE9.

Apparently there are two image formats (called v1 and v2 or boot mode 1 and boot mode 2).
The former (older) one — used by Arduino/Tasmota — starts with 0xE9, while the latter (and apparently newer one) — used by the original firmware — starts with 0xEA.

The technical differences are very well documented by the ESP8266 Reverse Engineering Wiki project, regarding the flash format and the v1/v2 headers in particular the SPI Flash Format wiki oage.

The original bootloader only accepts images starting with 0xEA while the bootloader provided by Arduino/Tasmota only accepts such starting with 0xE9.

3rd attempt

Converting Arduino images to v2 images

Easier said than done, as the Arduino framework doesn’t seem to be capable of creating v2 images and none of the common tools appear to have conversion functionality.

Taking a closer look at the esptool.py project however, there seems to be (undocumented) functionality.
esptool.py has the elf2image argument which — according source — allows switching between conversion to v1 and v2 images.

When using elf2image and also passing the --version parameter — which normally prints out the version string of the tool — the --version parameter gets redefined and expects an then argument: 1 or 2.

Besides the sonoff.ino.bin file the Tasmota project also creates an sonoff.ino.elf which can now be used in conjunction with esptool.py and the elf2image-parameter to create v2 images.

Example: esptool.py elf2image --version 2 tmp/arduino_build_XXXXXX/sonoff.ino.elf

WORKS! MOEP! WORKS! MOEP!

Remember the upgrade-action passed a 2-element list of download URLs to the device, having different names (user1.bin and user2.bin)?

This procedure now only works if the user1.bin image is being fetched and flashed.

Differences between user1.bin and user2.bin

The flash on the Sonoff devices is split into 2 parts (simplified!) which basically contain the same data (user1 and user2). As OTA upgrades are proven to fail sometimes for whatever reason, the upgrade will always happen on the currently inactive part, meaning, if the device is currently running the code from the user1 part, the upgrade will happen onto the user2 part.
That mechanism is not invented by Itead, but actually provided as off-the-shelf OTA solution by Espressif (the SoC manufacturer) itself.

For 1MB flash chips the user1 image is stored at offset 0x01000 while the user2 image is stored at 0x81000.

And indeed, the two original upgrade images (user1 and user2) differ significantly.

If flashing a user2 image onto the user1 part of the flash the device refuses to boot and vice versa.

While there’s not much information about how user1.bin and user2.bin technically differ from each other, khcnz pointed me to an Espressif document stating:

user1.bin and user2.bin are [the] same software placed to different regions of [the] flash. The only difference is [the] address mapping on flash.

4th attempt

So apparently those 2 images must be created differently indeed.

Again it was khcnz who pointed me to different linker scripts used for each image within the original SDK.
Diffing
https://github.com/espressif/ESP8266_RTOS_SDK/blob/master/ld/eagle.app.v6.new.1024.app1.ld
and
https://github.com/espressif/ESP8266_RTOS_SDK/blob/master/ld/eagle.app.v6.new.1024.app2.ld
reveals that the irom0_0_seg differs (org = 0x40100000 vs. org = 0x40281010).

As Tasmota doesn’t make use of the user1-/user2-ping-pong mechanism it conly creates images supposed to go to 0x1000 (=user1-partition).

So for creating an user2.bin image — in our case for a device having a 1MB flash chip and allocating (only) 64K for SPIFFS — we have to modify the following linker script accordingly:

--- a/~/.arduino15/packages/esp8266/hardware/esp8266/2.3.0/tools/sdk/ld/eagle.flash.1m64.ld
+++ b/~/.arduino15/packages/esp8266/hardware/esp8266/2.3.0/tools/sdk/ld/eagle.flash.1m64.ld
@@ -7,7 +7,7 @@ MEMORY
   dport0_0_seg :                        org = 0x3FF00000, len = 0x10
   dram0_0_seg :                         org = 0x3FFE8000, len = 0x14000
   iram1_0_seg :                         org = 0x40100000, len = 0x8000
-  irom0_0_seg :                         org = 0x40201010, len = 0xf9ff0
+  irom0_0_seg :                         org = 0x40281010, len = 0xf9ff0
 }
 
 PROVIDE ( _SPIFFS_start = 0x402FB000 );

So we will now create an user1 (without above applied modification> and an user2 (with above modification> image and converting them to v2 images with esptool.py as described above.

–> WORKS!

Depending on whether the original firmware was loaded from the user1 or user2 partition, it will fetch and flash the other image, telling the bootloader afterwards to change the active partition.

Issues

Mission accomplished? Not just yet…

Although our custom firmware is now flashed via the original OTA mechanism and running, the final setup differs in 2 major aspects (compared to if we would have flashed the device via serial):

Each point alone already results in the Tasmota/Adruino OTA mechniasm not working.
Additionally — since the bootloader stays the original one — it still only expects v2 images and still messes with us with its ping-pong-mechanism.

This issue is already being addressed though and discussed on how to be solved best in the issue ticket mentioned at the very beginning.

Happy hacking!

intel 540s SSD fail

My intel SSD failed. Hard. As in: its content got wiped. But before getting way too theatrical, let’s stick to the facts first.

I upgraded my Lenovo ThinkPad X1 Carbon with a bigger SSD in the late summer this year — a 1TB intel 540s (M.2).

The BIOS of ThinkPads (and probably other brands as well) offer to secure your drive with an ATA password. This feature is part of the ATA specification and was already implemented and used back in the old IDE times (remember the X-BOX 1?).

With such an ATA password set, all read/write commands to the drive will be ignored until the drive gets unlocked. There’s some discussion about whether ATA passwords should or shouldn’t be used — personally I like the idea of $person not being able to just pull out my drive, modify its unencrypted boot record and put it back into my computer without me noticing.

In regard of current SSDs the ATA password doesn’t just lock access to the drive but also plays part in the FDE (full disk encryption) featured by modern SSDs — but back to what actually happened…

As people say, it’s good practice to frequently(TM) change passwords. So I did with my ATA password.

And then it happened. My data was gone. All of it. I could still access the SSD with the newly set password but it only contained random data. Even the first couple of KB, which were supposed to contain the partition table as well as unencrypted boot code, magically seem to have been replaced with random data. Perfectly random data.

So, what happened? Back to FDE of recent SSDs: They perform encryption on data written to the drive (decryption on reads, respectively) — no matter if you want it or not.
Encrypted with a key stored on the device — with no easy way of reading it out (hence no backup). This is happening totally transparently; the computer the device is connected to doesn’t have to care about that at all.

And the ATA password is used to encrypt the key the actual data on the drive is encrypted with. Password encrypts key encrypts data.

Back to my case: No data, just garbage. Perfectly random garbage. First idea on what happened, as obvious as devastating: the data on the drive gets read and decrypted with a different key than it initially got written and encrypted with. If that’s indeed the case, my data is gone.

This behaviour is actually advertised as a feature. intel calls it “Secure Erase“. No need to override your drive dozens of times like in the old days — therewith ensuring the data is irreversible vanished in the end. No, just wipe the key your data is encrypted with and done. And exactly this seems to have happened to me. I am done.

Fortunately I made backups. Some time ago. Quite some time ago. Of a few directories. Very few. Swearing. Tears. I know, I know, I don’t deserve your sympathies (but I’d still appreciate!).

Anger! Whose fault is it?! Who to blame?!

Let’s check the docs on ATA passwords, which appear to be very clear — from the official Lenovo FAQ:

“Will changing the Master or User hard drive password change the FDE key?”
– “No. The hard drive passwords have no effect on the encryption key. The passwords can safely be changed without risking loss of data.”

Not my fault! Yes! Wait, another FAQ entry says:

“Can the encryption key be changed?”
– “The encryption key can be regenerated within the BIOS, however, doing so will make all data inaccessible, effectively wiping the drive. To generate a new key, use the option listed under Security -> Disk Encryption HDD in the system BIOS.”

Double-checking the BIOS if I unintentionally told my BIOS to change the FDE key. No, I wasn’t even able to find such a setting.

Okay — intermediate result: either buggy BIOS telling my SSD to (re)generate the encryption key (and therewith “Secure Erase” everything on it) or buggy SSD controller, deciding to alter the key at will.

Google! Nothing. Frightening reports about the disastrous “8MB”-bug on the earlier series 320 devices popped up. But nothing on series 540s.

If nothing helps and/or there’s nobody to blame: go on Twitter!

Some Ping-Pong:

Then…

Wait, what?! That’s a known issue? I didn’t find a damn thing in the whole internets! Tell me more!

And to my surprise – they did. For a minute. Shortly before having respected tweets deleted.

Let’s take a look on what my phone cached:

The deleted tweets contain a link http://intel.ly/2eRl73j which resolves to https://security-center.intel.com/advisory.aspx?intelid=INTEL-SA-00055&languageid=en-fr which is an advisory seemingly describing exactly what happened to me:

“In systems with the SATA devsleep feature enabled, setting or resetting the user or master password by the ATA security feature set may cause data corruption.”

Later on:

“Intel became aware of this issue during early customer validation.”

I guess I just became aware of being part of the “early customer validation”-program. This issue: Personally validated. Check.

Ok, short recap:

Meanwhile, I could try to follow up on @lenovo’s tips:

Sounds good! Maybe, just maybe, that could bring my data back.

Let’s skip the second link, as it contains a dedicated Windows software I’d love to run, but my Windows installation just got wiped (and I’m not really keen of reinstalling and therewith overriding my precious maybe-still-not-yet-permamently-lost data).

The first link points to an ISO file. Works for me! Until it crashes. Reproducibly. This ISO reproducibly crashes my Lenovo X1 Carbon 3rd generation. Booting from USB thumb-drive (officially supported it says), as well as from CD. Hm.

For now I seem to have to conclude with the following questions:

 

PS: Before I clicked the Publish button I again set up a few search queries. Found my tweets.

Protected: There’s no such thing as bad publicity…

This content is password protected. To view it please enter your password below:

Flash memory

Flash in embedded devices is a really hot topic – there are quite some questions which should be asked before choosing the “right” flash memory type. Often the wrong decision turns out to backfire – I experienced this in several projects and companies.
 
Questions which should get asked for evaluating possible flash chips and types are:
 
Those questions should be dealt with quite carefully when evaluating the right type of flash for a project.
 
Since to most of the above questions there aren’t really simple answers but implications to each other and of course each solution its very own advantages and disadvantages, I’ll try to illustrate some scenarios instead.
 

Erase blocks and erase cycles

 
Flash storage consists of so-called “erase blocks” (just called blocks from now on). Its size highly depend on the kind of flash (NOR/NAND) and the flash total size.
 
Usually NOR flash has much greater blocks than NAND flash – typical block sizes are e.g. 64KB for a 4MB NOR flash, 64KB for a 256MB NAND flash, 128KB for a 512MB NAND flash.
When flash (especially NAND flash) got bigger and bigger in storage size, pages and later sub-pages got introduced.
 
NAND flash consists of erase blocks which might consist of pages which might consist of sub-pages.
 
Though pages and sub-pages can be directly addressed for reading, only whole erase-blocks can be – guess what? – erased.
 
NAND flash also may contain an ‘out of band (OOB) area’ which usually is a fraction of the block or – if any – page size. This is dedicated for meta information (like information about bad blocks, ECC data, erase counters, etc.) and not supposed to be used for your actual data payload.
Flash storage needs to be addressed ‘by block’ for writing. Blocks can only be ‘erased’ for certain times, till they get corrupted and unusable (10.000 – 100.000 times are typical values).
 
Usual conclusion of above is, flash storage can be read but not written byte wise – to write to flash, you need to erase the whole block before. This is not totally wrong, but misleading since simplified.
Flash storage cells by default have the state “erased” (which matches the logical bit ‘1’). Once a bit got flipped to ‘0’ you can only get it back to ‘1’ by erasing the entire block.
 
Even though you can indeed only address the flash “per erase block” for writing – and you have to erase (and therewith write the whole erase block) when intended to flip a bit from ‘0’ to ‘1’ – that doesn’t mean every write operation needs a prior block erase.
Bits can be flipped from ‘1’ to ‘0’ but only an entire block can be switched  (erased) in order to get bits within back to ‘1’.
 
Considering an (in this example unrealistic small) erase block contains ‘1111 1110’ and you want to change it to, let’s say, ‘1110 1111’, you have to:
1) erase the whole block, so it will be ‘1111 1111’
2) flip the 4th byte down to ‘0’
But if we e.g. want to turn ‘1111 1110’ into ‘1010 0110’ we just flip the 2nd, 4th and and 5th bit of the block. That way we don’t need to erase the whole block before, because no bit within the block needs to be changed from ‘0’ to ‘1’.
That way, due to clever write handling of the flash, not every write operation within erase blocks imply erasing the block before.
 
Taking this into account might significantly enhance the lifetime of your flash (especially SLC NAND, MLC requires some even more sophisticated methods), as blocks can only be erased a certain number of times.
It might also allow you to take cheaper flash (with less guaranteed erase cycles).
 
Also you should make sure, that you don’t keep a majority of erase blocks untouched, while others get erased thousands of times.
To avoid this, you usually some kind of ‘wear leveling’. That means, you keep track of how many times blocks got erased, and – if possible – relocate data on blocks, which gets changed often, to blocks which didn’t get erased that often.
To get wear leveling done properly, you need to have certain amount of blocks in spare to be able to rotate the blocks and relocate the actual data properly.

Hardware vs. Software

Both techniques can significantly improve the lifetime of your flash, however require quite some sophisticated algorithms to get a reasonable advantage over not using them.
Bare flash doesn’t deal with those issues. There is controller hardware taking care of that (e.g. in higher-class memory cards / USB sticks), but it usually sucks. If possible, do it in software.
Using e.g. Linux to drive your flash, offers quite some possibilities here (looking at file-systems).

NAND vs. NOR

NOR blocks compared to those of NAND, in relation to the total storage size of the flash, are quite big – which means:

Bad blocks

Bad blocks and how they’re dealt with decides whether you can use your flash in the end as normal storage or end up in debugging nightmares, caused by weird device failures which might result of improper bad block handling of your flash.
 
A bad block is considered an erase block, which shouldn’t be used anymore because it doesn’t always store data as intended due to bit flips. This means, parts of the block can’t be erased (flipped back to ‘1’ anymore, parts “float” and can’t be get into a well-defined state anymore, etc.). Blocks only can be erased certain amount of times, until they get corrupted and therewith ‘bad’. Once you encounter a probably bad block, it should be marked as bad immediately and never ever be used again.
 
Marking a bad block means: Adding this bad block to a table of bad blocks (which is mostly settled at the end of the flash). Since this table sit within blocks on the flash, which might get bad as well, this table is usually redundant.
 
Bad blocks happen, especially on NAND flash. Even never used NAND flash, right from the factory, might contain bad blocks. Because of that, NAND flash manufacturers ship their flash with “pre-installed” information about which blocks are bad from the beginning. Unfortunately, how this information is stored on the flash, is vendor / product specific. Another common area to store this kind of information the OOB area (if existing) of the flash.
 
This also means, flash of the very some vendor and product, will have a different amount of usable blocks. This is a fact you have to deal with – don’t be too tight in your calculation, you’re going to need space in spare as replacement for bad blocks!

Hardware vs. Software

There are actually NAND chips available doing bad block management by their own. Although I never  used them myself, what I read it sounds quite nice. Most bare flash however doesn’t deal with bad blocks. There might be pre-installed information about bad blocks from the beginning, there might be not. And if there is, the format is what the vendor chose to be the format.
That means, if using a flash controller dealing with bad blocks for you, it needs to be able to read and write the type of format those information are stored in. It needs to be aware about whether the flash has an OOB-areas or not and how they’re organized.
 
There is quite some sophisticated and flexible flash controller hardware out there, but again: if possible, do it in software. There is quite good and proven code for that available.

NAND vs. NOR

NOR flash is way more predictable than NAND flash. NAND flash blocks might get bad whenever they want (humidity, temperature, whatever..).
Reading NAND flash stresses it as well – yes, reading NAND flash causes bad blocks! Although this doesn’t happen as often as due to write operations (and far less on SLC than on MLC flash), it happens and you better deal with it!
The number of expected erase cycles is mentioned in the data-sheets of NAND flashes. The number of read cycles isn’t. However it’s usually something around 10 to 100 times the erase cycles.
Imagine you want to boot from your NAND flash, and the boot-loader – which doesn’t deal yet with bad blocks – sits within blocks which react unpredictable / become bad.. a nightmare!
That’s why it’s common now, that the first few blocks of NAND flash are guaranteed to be safe for the first N erase cycles. Make sure your boot-loader fits into those safe blocks!

SwitchSmart!

The “radio controlled power sockets”-project finally got its very own name and project site:

SwitchSmart!

Several improvements happened since my last post about this project:

PostgreSQL – altering table name does not update references to its primary key and sequence automatically

During a setup of multiple PostgreSQL instances, replicating content via the replication framework Slony-I, I had to manually create the very same SQL schema to every Postgres node – as Slony just replicates the payload data and not the actual SQL schemas.

I was creating tables like this on every node:

CREATE TABLE x (
id             SERIAL PRIMARY KEY,
content   VARCHAR(255) DEFAULT NULL
);

but decided after half of having those nodes I already configured to rename the table from ‘x‘ to ‘y‘ using the ‘ALTER TABLE‘ command

ALTER TABLE x RENAME TO y;

and continued creating the schemas on the remaining nodes with the following command:

CREATE TABLE y (
id             SERIAL PRIMARY KEY,
content   VARCHAR(255) DEFAULT NULL
);

After finally having provided the schema to all nodes I started the replication daemons and got thrown errors from half of the nodes that replication doesn’t work properly since the schema doesn’t match the one on the master replication server:

CESTERROR  remoteWorkerThread_1: “select “_db”.setAddTable_int(1, 3, ‘”public”.”y”‘, ‘x_pkey’, ‘Table public.y with primary key’); ” PGRES_FATAL_ERROR ERROR:  Slony-I: setAddTable_int(): table “public”.”y” has no index x_pkey

All of the non-working nodes were those, I first created the table x on and later renamed it to y, instead of just directly creating table y like I did on the others.

Looking at the global table definitions – including the automatically co-created sequence – you can see that the table did get renamed, but the sequence didn’t:

Table y directly created:

postgres=# d
List of relations
Schema |      Name      |   Type   |  Owner
——–+—————-+———-+———-
public | y              | table    | postgres
public | y_id_seq       | sequence | postgres
(4 rows)

Table x created and altered to be named y:

postgres=# d
List of relations
Schema |      Name      |   Type   |  Owner
——–+—————-+———-+———-
public | y              | table    | postgres
public | x_id_seq       | sequence | postgres
(4 rows)

That normally doesn’t cause any trouble, since the reference of table y (formerly x) to the sequence x_id_seq is still valid. However since replication requires the very exact same schema on every node this actually is causing trouble. However that’s not the error mentioned in the error message above, which is referring to the primary key.

Diff’ing the actual schemas shows up more differences:

                                  Table “public.y”
Column  |          Type          |                   Modifiers
———+————————+————————————————
– id      | integer                | not null default nextval(‘x_id_seq‘::regclass)
+ id      | integer                | not null default nextval(‘y_id_seq‘::regclass)
content | character varying(255) | default NULL::character varying
Indexes:
–    “x_pkey” PRIMARY KEY, btree (id)
+    “y_pkey” PRIMARY KEY, btree (id)

The reference to the sequence and and name of the reference to the value of the primary key were NOT updated by altering the table name to match again. This separation of table-name and references might be a feature, however I find it hard to imagine a use-case where it makes sense using the sequence and/or primary key of another table. UPDATE: I just got told that it indeed might make sense sharing one sequence among several tables.

Also sequence and primary key were created inside / co-created by the ‘CREATE TABLE‘ statement, so at least I’d find it more consistent if both would be always reference by this table, by means of the table they got originally created with.

Looking for information, hints or documentation about this behaviour wasn’t fruitful as well.

So personally I’d really like to really see those reference updated – by changing the tables name – automatically.  I’d like to see at least a NOTICE that primary key and sequence are still haveing the name of / are referring to it’s old values and need to be updated / re-created to match again.

Conclusion: Make sure – when altering table names in Postgres – references to primary key and sequence are getting updated as well – manually! Primary key and sequence are NOT tied together with the table they got created with!

LinuxInput devices (in Qt embedded) – take care!

I’m currently working on an embedded project based on Linux which involves a graphical user interface based on Qt.

In our case Qt accesses framebuffer as well as the LinuxInput devices directly – there is no further layer (DirectFB, Xorg, Wayland, etc.) in between.

In this quite common scenario I noticed that Qt is treating the LinuxInput devices in a quite counterintuitive way which may lead into severe security problems.
The embedded device we are using in this particular project has a keyboard which to the userland is exposed as LinuxInput device.

Everything worked quite well, however our Qt application crashed once due to a programming error (our fault) and I was shown the underlying UNIX console which reflected every single keystroke I did inside the Qt application – all typing input within the Qt application also got passed to other applications / underlying shells!

An exposed root shell to the active TTY – possibly hidden by the GUI – is therewith also capturing all input. Since almost all linux (embedded) distributions are exposing login prompts / shells to all TTYs by default, this scenario is far from being unlikely.

My expected behavior would have been: the Qt application is opening the LinuxInput devices exclusively, so that only the Qt application receives data from the input devices.

After some research and asking around I got pointed to the EVIOCGRAB IOCTL implemented in the LinuxInput subsystem.

As stated in $(LINUX_KERNEL)/include/input/linux.h:

@grab: input handle that currently has the device grabbed (via
EVIOCGRAB ioctl). When a handle grabs a device it becomes sole
recipient for all input events coming from the device

This means: When opening LinuxInput devices they’re not grabbed, they’re not opened exclusively by default.

Okay, I got it 🙂

So let’s just find the magic option / command which tells Qt to set up an EVIOCGRAB with an appropriate argument on the going-to-be-used LinuxInput device. Since I didn’t find any documentation I grep’ed through the Qt source:

> ~/src/qt.git$ git grep EVIOCGRAB
> ~/src/qt.git$

nothing… so I took a closer look into how Qt implemented support for LinuxInput devices (src/gui/embedded/qkbdlinuxinput_qws.cpp) – still nothing…

So I implemented support for telling Qt to open LinuxInput devices exclusively by optionally passing an ‘grab’-argument to the LinuxInput drivers inside Qt.

And here it is – a patch which adds support for passing another parameter named ‘grab’ to the LinuxInput device driver of Qt – to specify whether the device should be opened exclusively (grabbed) or not.

https://dev.openwrt.org/browser/packages/Xorg/lib/qt4/patches/500-allow-device-grabbing.patch?rev=26780

Conclusion: Take care when using Qt and its LinuxInput drivers – all input might be received and used by other applications as well – including shells running on TTYs.

 

UPDATE: The patch finally git committed and went upstream. Grabbing can now be configured since Qt version 4.8 (http://www.qt.gitorious.org/qt/qt/commit/947aaa79b05adec527c7500e36766c7ff19f118d/diffs)

RFM12 – kernel patches

Since I got asked several times about the pin mappings and wirings between the rfm12 modules and GPIOs of the devices providing them (in my case the Netgear WGT634U router / the Qi NanoNote) I’d like to try making some things clearer:

In Linux the use of buses is tried to get as abstracted as possible – the idea is that the actual boards (or call it whatever you like as devices, platforms etc) define and expose the capability of buses and its properties.

In our case the rfm12 kernel module requires the availability of an SPI bus – it doesn’t matter how it is implemented (native SPI bus, SPI over GPIOs or any other way of implementation). The client module – the kernel module implementing an driver for the actual rfm12 hardware – simply doesn’t care and doesn’t need to know about that.

That’s the reason why the actual rfm12 code doesn’t contain any GPIO <-> rfm12 hardware mappings – the rfm12 code is just using an existing and otherwise exposed SPI bus.

The actual wiring / mapping and setup of the SPI bus is done within the platform / board / device code, which is located below arch/${ARCH}/${BOARD} – a common place for code like that is arch/${ARCH}/${BOARD}/setup.c.

This project however seemed to have raised some interest and I got asked quite a few times about the board specific changes I made – so here they are now:
the kernel patches which provide SPI buses on both boards, including GPIO mappings.

Since both targets I used the rfm12 module and driver on are running OpenWrt, I created both patches against the Linux vanilla tree having OpenWrt specific kernel patches already applied.

Changes however are small and clear, so they should be easy to understand and adapt.

The wiring I used to get the rfm12 module working on the NanoNote working by the way is the following:


GPIO | PIN on SD port | PIN on module | purpose / description
=====|================|===============|=================================
108  | D12 (1)        | MISO / SDO    | SPI: master input slave output
109  | D13 (2)        | nIRQ          | interrupt
104  | D08 (3)        | (unused)      | (unused)
X    | VDD (4)        | VDD           | power
105  | D09 (5)        | MOSI / SDI    | SPI: master output slave input
X    | VSS (6)        | GND (1+2)     | ground
106  | D10 (7)        | SCK           | SPI: clock
107  | D11 (8)        | nSEL          | SPI: chip select

There needs to be a resistor (10-100kOhm) between pin FSK (if used) of
the rfm12 module and VDD as pullup – however when just using ASK it isn’t needed anyway.

GDB behaves strange while debugging threads

While debugging issues involving binaries on a system running Linux, having a debugger such as GDB available is quite helpful.

However while working on a certain project we recently experienced quite some issues debugging applications involving threads.

Debugging the application on my local workstation worked quite fine, however on OpenWrt-targets – ARM as well as MIPS – it behaved rather strange: stack corruptions, missing traces, weird signals got issued…

After quite some time of debugging the debug issue, we found out the issue is caused by a stripped version of libpthread.so.

Stripped – not in the sense of a more lightweight but compatible version of the pthread library – but stripped by the utility “strip”, which purges all debug- and “other unneeded” symbols out of binaries to reduce their size, which usually is applied on all binaries by the OpenWrt framework automatically.

Usually binaries stripped by “strip” are still fully-fledged binaries, still usable with GDB (however without debugging symbols available of course). Applying strip on libpthread.so* however, it seems to strip out also stuff needed by GDB following and tracing threads. Without these symbols / meta-information not needed for running the actual application, but for tracking its threads, GDB results in mentioned issues above.

One might ask why someone is debugging binaries without debug symbols compiled in – reasons are obvious:

To check whether an object got stripped or not is quite easy using the “file” util:

$ file build_dir/target-arm_v5te_uClibc-0.9.30.1_eabi/root-foo/lib/libpthread-0.9.30.1.so
build_dir/target-arm_v5te_uClibc-0.9.30.1_eabi/root-foo/lib/libpthread-0.9.30.1.so: ELF 32-bit LSB shared object, ARM, version 1 (SYSV), dynamically linked (uses shared libs), stripped

$ file staging_dir/target-arm_v5te_uClibc-0.9.30.1_eabi/root-foo/lib/libpthread-0.9.30.1.so
staging_dir/target-arm_v5te_uClibc-0.9.30.1_eabi/root-foo/lib/libpthread-0.9.30.1.so: ELF 32-bit LSB shared object, ARM, version 1 (SYSV), dynamically linked (uses shared libs), not stripped

Long story short: When debugging applications involving threads, always use a non-stripped version of libpthread.so, even if debug symbols are not needed!

Ben NanoNote able to control radio power sockets

The Ben NanoNote is now able to switch radio controlled power sockets, too!

The rfm12 433MHz module, produced by HopeRF,  is attached to the microSD port of the Ben NanoNote, which pins are exposed via an microSD card dummy adapter.

rfm12 module attached to a microSD dummy

The System-on-a-chip used inside the Ben NanoNote (ingenic JZ47XX) allows us to put the microSD pins into GPIO mode, so we can (as already done for the first device, the Netgear WGT634u router, I connected an rfm12-module to) create and export an SPI bus on top of them to be able communicating with the module.

That way there’s no need of opening the device and/or soldering anything anywhere – the module is attached directly to the microSD dummy which gets simply inserted into microSD slot of the NanoNote.

rfm12 module attached to NanoNote running ncurses UI for switching power sockets

I wrote a basic but fully working ncurses frontend using the rfm12 library listing configured devices waiting for getting switched.

There’s also a ready-to-use XMLRPC daemon available which exposes all configured devices and provides functions for controlling them, so UIs for controlling devices are not limited to run on the same system the rfm12 module is installed on.

ncurses UI for switching configured radio controlled power sockets on the Ben NanoNote

I’ve also written a little GUI in python using the qt4 toolkit, connecting to the master via XML-RPC:

switching radio controlled power via a GUI using pyqt4

Several other frontends are work-in-progress as e.g. a GUI for the Android platform, as well as one based on qt4/QML being able to run e.g. on phones running Meego/Maemo as operating system – both using the XML-RPC interface.

So all major parts of the project are mostly finished now and the API is more or less fix.

The the whole project now consists of (a rough overview):

  1. the kernel module which communicates with the actual hardware (so the rfm12 module) and exposes a character device to userspace
  2. the rfm12 library which
    • connects to the kernel module
    • contains the device type specific data and code to modulate signals which control the actual power sockets
    • provides functions for reading / writing configuration files and controlling / switching devices
  3. applications using the library and its functions for actual controlling of devices, which could be / already are:
    • UI applications linked directly against librfm12 (an ncurses frontend (shown above) is available yet)
    • daemons providing network interfaces (a daemon exporting functionality via XMLRPC is available yet , one doing the same but using JSON-RPC as underlying rpc method is going to be implemented soon)
  4. UIs running on different machines, using these RPC services  (e.g. above pyqt4  frontend; android- and qml-frontends are work-in-progress)

Devices are getting configured via configuration files, describing the product type of the device, a name and the actual code which is used to identify devices of a certain product type group.

The config used by the applications shown on the photos / screenshots above just looks this:

[socket_A]
label = “one”
product = “P801B”
code = “1111110000”
[socket_B]
label = “two”
product = “2272”
code = “1111110000”
[socket_C]
label = “three”
product = “2272”
code = “1111101000”

One of the tasks I’m working on right now, is state sharing. While the list of configured devices and it’s states (on/off) is already shared among all XML-RPC clients (so having switched a unit in one client, others will fetch the changed state next time they poll/refresh), the state is not yet shared between several processes invoking the librfm12.

Issue is: Every process linked against librfm12 creates its own list of devices, including states – so every process has its own copy. Changes done in one process are not shared among others. This could be solved using IPC (System V shared memory, sockets every instance connects to, etc.).

That’s it – feedback and/or participation is highly appreciated!


Ben NanoNote having an rfm12 module attached via microSD-port and several types of radio controlled power sockets

Source Code is still available on github: https://github.com/mirko/rfm12-ASK-for-linux 🙂

Next Page →