The Planetary Society’s LightSail won’t stay in orbit long once its sail deploys, a victim of inexorable atmospheric drag. But we’re all lucky that in un-deployed form — as a CubeSat — LightSail can maintain its orbit for about six months. Some of that extended period may be necessary given the problem the spacecraft has encountered: After returning a healthy stream of data packets over its first two days of operations, the solar sail mission has fallen silent.
Jason Davis continues his reporting on LightSail, with the latest update on the communications problem now online. We learn that the suspected culprit for LightSail’s silence is a simple software glitch. Everything else looked good when communications ceased, with power and temperature readings stable. Davis explains that during normal operations, LightSail transmits a telemetry beacon every 15 seconds. The Linux-based flight software writes data on each transmission to a .csv file, a spreadsheet-like record of ongoing procedures.
This file continues to grow, and when it reaches a certain size, trouble can happen:
As more beacons are transmitted, the file grows in size. When it reaches 32 megabytes—roughly the size of ten compressed music files—it can crash the flight system. The manufacturer of the avionics board corrected this glitch in later software revisions. But alas, LightSail’s software version doesn’t include the update.
Late Friday, the team received a heads-up warning them of the vulnerability. A fix was quickly devised to prevent the spacecraft from crashing, and it was scheduled to be uploaded during the next ground station pass. But before that happened, LightSail fell silent. The last data packet received from the spacecraft was May 22 at 21:31 UTC (5:31 p.m. EDT).
Let’s hope we’ll still see a deployed LightSail, as in the image above. But anyone who has stared at a PC frozen into immobility knows the feeling that LightSail’s ground controllers must have experienced. The machine is not responding, which means it’s time for a reboot. A manual reboot being out of the question, a reboot command from the ground has to be used, and more than one has been sent. In fact, Cal Poly has been transmitting a new reboot command every few ground station passes. So far, no luck.
A fix may still be in the works from a natural source, but first, the situation led to a bit of humor, in the form of an email Davis received, as recorded in this tweet:
Davis also suggests a LightSail successor to be called BourbonSat, a flight spare that sits in each team member’s kitchen to offer quick stress relief. The humor is edgy but that’s because we may now be reliant on a hands-off fix: Charged particles striking an electronic component in just the right way to cause a reboot. If that sounds extreme, be aware that the phenomenon is not unusual in CubeSats. In fact, Cal Poly’s experience says that most reboot within the first three weeks of operations. You can place this in the context of the 28-day sail deployment timeline and see we might come out just fine.
What happens next depends upon when — and if — that reboot occurs, assuming the continued reboot commands from the ground are not effective. Various software fixes are being tested to see which could be inserted after contact is restored, so that the troublesome .csv file doesn’t cause further problems. Davis also says that when LightSail comes back online, the team will probably begin a manual sail deployment as soon as possible. Let’s make sure, in other words, that when we have a communicating spacecraft, we do what we sent it out there to do.
why in the world would they not send up two cube sats (or more) in the first place? wasn’t that the whole point of a cube sat to begin with (cheap, can make redundant, etc). this just kinda seems like a mega derp tbqh.
Software glitch? Sounds more like a design flaw. Since as long as I can remember, even the cheapest embedded processors have had a hardware feature called a watchdog timer. Properly functioning software periodically resets the timer. If the software ever goes off into the weeds and fails to reset the timer, the timer times out and reboots the system. Waiting for a cosmic ray? That’s no way to do fail safe design. Why wasn’t a watchdog timer used on this system?
Yup. Inferior quality software. It’s a real shame, after so much effort put in on the rest of the system – not to mention the hard-earned cash of the people who paid for it
Mark and Andrew I agree with you.
They should have used a timer to reset the system. In addition log files growing too big is a pretty common problem. A well designed system would trim the log files.
We should keep in mind that this lightsail is a test flight. Now they are well aware of some problems to look out for in the production flight. Hopefully someone is taking notes.
I am confident they will have more interesting problems in the real flight.
This incident especially caught my eye. I’m a faculty member at Vermont Technical College currently working with a small group of students writing CubeSat software. We successfully launched our first CubeSat in November of 2013, and are now working on a general purpose software infrastructure for future CubeSat missions.
We are using SPARK, a dialect of Ada that allows for the formal verification of programs. We aim to produce software that we can prove mathematically is “uncrashable.” With SPARK this is a realistic goal.
http://www.cubesatlab.org/
http://www.spark-2014.org/
If Lightsail-1 cannot be rebooted, is there any critical information needed that could delay their second launch next year, or is this a shakedown that identifies potential failures only? In this case it looks like the software failed and they may not be able to test the deployment.
@Peter Chapin
Good to see that some groups are using Ada to develop more robust code. The extra advantages of using Spark are worthwhile too. How easy has it been to get/find good SW engineers who know the language?
Good luck to them, hope nature co-operates with an ionised particle hitting the right part of the craft to cause a reboot!
…Shades of Telstar-1’s problem from the Project Starfish nuclear detonation radiation, which shortened its life; the engineers had to periodically “rest” the transistors and resort to “notched zeroes” to send commands to the satellite. Also:
Perhaps LightSail-A’s mission controllers won’t have to wait for a fortuitous Cosmic Ray. If the LightSail-A spacecraft were illuminated with microwaves from a Deep Space Network station (or from a radar astronomy dish), that might induce enough of an electrical charge in the spacecraft “bus” to have the same effect as a Cosmic Ray. (Had the original Cosmos-1 solar sail succeeded, there was a plan to test microwave sailing with it, by “beaming” it from the Goldstone DSN station; maybe that plan could “snatch their fat out of the fire” now?)
Unfortunately, we learn NOTHING from this incident! Watchdog timers and log file management are old old news in computer systems. Nobody needed to pay launch costs or lose a mission to understand this failure mode.
It makes on wonder what other foreseeable design flaws exist in this device.
I have been working in software development and IT for twenty years now. It has been increasingly frustrating to watch people repeat the same basic mistakes over and over again.
Mark you are right, there is nothing new in these kinds of errors we are discussing. Unfortunately these fundamental lessons need to be taught over and over again.
I was surprised to read Peter Chapin’s comments about teaching Ada and verifiable programs. Uncrashable software is a good idea, but would it have really helped here? The defects are more system design errors rather than software defects.
I have my doubts. Provable programs are dependent on specifications for their behaviour. The proofs are about what you intended, not that what you intended was correct.
Apparently a reboot occurred and some packets have been received.
https://twitter.com/jasonrdavis/status/604785976456548352
@Matt: “Uncrashable software is a good idea, but would it have really helped here? The defects are more system design errors rather than software defects.”
Since the problem in this case caused a crash, the verification process could have potentially exposed the issue. Correcting it could then have pointed the way to the fundamental design flaw. The developer reasons like this: “Why is a crash possible here? It’s possible because I might overflow such-and-such a value. How could that value overflow? Because I’m not managing my log files properly.”
@Mark S: “Provable programs are dependent on specifications for their behaviour. The proofs are about what you intended, not that what you intended was correct.”
There is no doubt that proving freedom from runtime error is only one step in a larger process. Ideally formal analysis could also be used to prove that the program implements the design and that the design implements the requirements. That is a tall order, and even this leaves open the problem of incomplete or incorrect requirements.
In most cases formal methods can’t, at this time, reasonably be used for all stages of the software development life cycle (although the technology is constantly improving). Thus, using formal methods will not generate a program that is absolutely guaranteed to be correct. They only serve to increase confidence in a program’s correctness. However the sad reality is that many programs exist that exhibit errors that could have been easily avoided using currently feasible formal methods. What is needed is software engineers who understand how to use the technology available now.
Although great news does anyone else find it worrying that cosmic rays can cause reboots, I mean what if we want this thing to go to Jupiter! It also points out a problem with radiation effects on small electronic packages especially nanotech.
I don’t doubt that formal methods are a good idea. I wish we could see more of that. I despair somewhat in a world where “devops” and sloppy design practices are often considered a virtue… :(
I don’t “worry” about cosmic rays. The fact of cosmic rays causing computers to experience errors is also old news. You design your systems so that they recover properly from incorrect operation caused by cosmic rays. If the computer reboots, the flight software just has to recover and continue whatever it was doing or call home for instructions. It is more than just reboots that are at issue, though, but many other errors can be recovered by forcing a reboot when the errors are detected.
The fact that they used a Linux board for a flight computer suggests they were more concerned about low cost than high reliability.
I keep thinking of Dr. Frankenstein shouting “It’s alive!” from the famous 1931 film about a certain science project put together from various parts then jolted into reality by outside natural forces:
http://www.planetary.org/blogs/jason-davis/2015/20150530-lightsail-phones-home.html
The sail may be deployed on June 2, aka tomorrow:
http://www.planetary.org/blogs/jason-davis/2015/20150531-lightsail-possible-tuesday-deploy.html
According to Planetary Society’s light sail page (http://www.planetary.org/blogs/jason-davis/2015/20150531-lightsail-possible-tuesday-deploy.html) they have decide to reboot the spacecrafts OS once a day to work around problems with the log file growing too big.
This is a very low standard of engineering. However this is a low cost test flight, so it may be adequate. Remember the purpose of this flight is to verify that they can deploy the sail. I assume this is also a learning opportunity for many of the people involved.