Currently the drive PC runs on old computer hardware under the PSOS operating system. This hardware would be
difficult to replace if it broke, and can not be easily upgraded to improve performance. The aim of porting
the drive PC code to Linux is to allow the use of more general computer hardware that can be replaced or
upgraded more easily in the future if required.
The primary constraint with the porting of the drive PC code is that it must behave exactly as the old code
does from the point of view of a client connecting to it. That is, it is undesirable to need to reprogram each
client program (such as
bruce, the FS antenna control etc.) to make it usable with the new
The secondary constraint is that we want the drive PC to be completely free of non-standard hardware that would
be required to control the telescope. In fact, we would like the drive PC to be “in contact” with as little
telescope hardware as possible. Instead, we would prefer to rely on some other less accessible piece of standalone
hardware, whose internal processes are very simple, to communicate with the hardware based on what the drive PC
instructs it to do, and to communicate back to the drive PC what state the hardware is in.
Main differences between old and new drive PC codes
- Where the old drive PC required a
root task to manage the startup process, the new drive PC uses the standard Linux startup scripts, and in fact has a standard Linux base. Currently, the new drive PC is based on Arch Linux.
- The old drive PC uses semaphores to communicate between processes, while the new drive PC uses shared memory.
- The old drive PC has a network socket listener that is always running waiting for a connection to a client process. The new drive PC uses
xinetd to spawn monitor and control processes that then no longer have to deal explicitly with network communication.
- The old drive PC could only manage a few monitoring connections at a time. This in part is because every client required the
hartbeat process to service the request, and
hartbeat was required to run completely within the 10 milliseconds between events. Because the workspace and system parameter structures are stored in shared memory in the new drive PC, it is possible to delegate the servicing of monitoring connections to the
ant_mon processes. This should also allow the PC to accept far more monitoring connections.
The old drive PCs had a hardware event generator that generated an interrupt some number of times per second.
This kept the cadence of the processing regular and allowed the drive PC to keep quite accurate time. The drive
PC’s event generator was set to generate interrupts 100 times per second.
The new drive PCs will have no trouble keeping time accurately, as NTP will do that for us. We still want a
regular cadence for the drive PC functions though, and it would be preferable to keep the 100 Hz cycle time.
So the new drive PC has a software event generator, which is called
event_generator. This program sets up an
internal alarm using the
struct itimerval timing structure and the
setitimer alarm function in GNU C.
This function takes the
itimerval structure, which has two elements, and asks for a
SIGALRM signal to be
given to it in a time
it_val (which can be specified in microseconds). After the
SIGALRM signal is
generated, the timer gets set automatically to the time
it_interval, which for us is set to 1/100 = 10,000
microseconds. Thus the cadence is always kept constant, as no code is required for our routine to reset the
Before all this however, the event generator also sets up a small area of shared memory. The shared memory segment
has a key of
100, and is 8 integer values in size (or 32 bytes). Element 0 of this segment is used by other
processes that want to get events from the event generator. These other processes should attach to the shared
memory, and check that element 0 is set to 0. If it is, then they may insert their PID into this shared memory
element, and the event generator will then add this PID to its list of processes that it sends events to. The
“event” is actually a
Element 1 is set when
event_generator starts to be the frequency in Hz of the events that the event generator
will supply. Although no effort is made to ensure this value is read-only, it should be treated as such by other
Element 2 stores the PID of the
perform process, which is described later.
Element 3 stores the PID of the
hartbeat process, which is described later.
Element 4 is set when
event_generator starts to be the DUT1 correction, and as for element 1, it should be
Element 5 stores the PID of the
sysjob process, which is described later.
Element 6 is used by the
hartbeat process to communicate to
sysjob what error condition has occurred.
Element 7 is set by
sysjob to be 1 when it is taking emergency action, and should be set to 0 at all other
Every time the timer produces a
SIGALRM signal, the event generator looks at shared memory element 0 to see
if another process has asked to be included, and adds its PID to the list if one has.
After this, the event generator goes through the list and sends a
SIGALRM signal to each process it knows
about. It should be noted therefore that although the event generator will always be triggered at a frequency
of 100 Hz, the time between successive alarm signals at the other processes is not guaranteed to be 10 milliseconds.
If the alarm signal can not be delivered, because the process no longer exists, or the event generator does not
have the proper authority to send signals to it, then the event generator will remove the PID from its list with
The heartbeat process is run in a program called
hartbeat, a leftover from the PSOS days when programs could only
have 8 character file names. The terms heartbeat and
hartbeat can and will be used interchangeably in this
The heartbeat process is responsible for:
- calculating where the telescope should be at any particular moment, including applying pointing corrections
- determining the rate the drives should be moving at
- checking to see that no problems have occurred in the telescope’s systems and that observing conditions are not dangerous
- ensuring that the telescope does not move past its operational limits
hartbeat task starts up first after the event generator and sets up four areas of shared memory. The observatory
code uses two structures to keep a track of what state the telescope is in (
struct wrkspace) and what the telescope is capable of
struct syspar). It was decided in the Linux version of the drive PC code that these two structures should be
kept in shared memory so that other processes could learn what state the telescope was in without having to directly
hartbeat. This was mainly driven by a desire to make monitoring and controlling the telescope
less complicated that it was with the PSOS drive PC; this will be discussed in more detail later.
wrkspace structure is given the shared memory key 101, while the
syspar structure has key 102.
The Linux drive PC also uses shared memory to hold the drive action semaphore and for completion messaging, with key
103. This is a 6
int sized memory area, where elements 0, 1 and 2 are for message passing, element 4 holds the
PID of the process waiting for the message and element 5 is the semaphore address.
The fourth shared memory segment (with key NEXTJOB_SHM_KEY) is required now that different tasks can all access the
wrkspace structure in
shared memory. This will be discussed in detail in the monitoring section below.
hartbeat reads in the configuration file
/obs/linux/cfg/host.cfg, which should contain all the
information about what the telescope is capable of in terms of limits, drive type (XY or AzEl), speeds etc. The use
of this config file makes it easy to use the exact same drive PC code for many telescopes. At this point, the pointing
correction files (specified in
host.cfg are read in as well. The hardware state is initialised and then queried to
see what state the antenna is in, and the time is set. As an interesting note,
hartbeat counts every time it gets
an event from the event generator, which should be 100 times per second. It is thus possible to determine how long the
hartbeat task has been running by looking at this number (
hartbeat_count), which should increment 8,640,000 times in one day.
SIGALRM is set up to trigger the
hartbeat routine, and then the process waits for its events to
hartbeat routine does the following:
- queries the hardware (not yet implemented 2008/12/05 JBS)
- increments the
- gets the time from the PCs internal clock to millisecond accuracy
- updates the
wrkspace structure with the current hardware state (drives, focus, limits, panic buttons, temperatures etc.)
- calculates the ephemeris and sidereal times
- update weather information
- read the encoders, and determine which wrap to be in (for AzEl antenna)
- check encoder values for consistency to detect glitches, and call
sysjob if a glitch is detected
- check the drives for problems, and call
sysjob if problems are detected
- check the wind speed to see if we should be wind stowed, and call
sysjob if a wind stow is warranted
- apply pointing corrections
- convert telescope native coordinates to all other coordinate systems
- call the
heartbeat_drive_action routine to control the drives (will describe this further below)
- apply acceleration limits
- calculate what we should be sending to the drives and ensure it is valid
- tell the drives to move at the calculated rate
- update the focus platform
- generate the flashing lights on the debug box
perform that we have completed our tasks in time
heartbeat_drive_action routine (in source file
driver.c) does the following:
- check if someone else has control of the drives using the drive action semaphore shared memory location, and exit immediately if this is the case
- calculate the stopping distance in native coordinates
- update the target position and rate depending on what the telescope is supposed to be doing (ie. tracking, slewing, scanning etc.)
- calculate the velocity of the target position in native coordinates, and check that the target position and stopping positions are inside the telescope limits, aborting the operation if they are not
- if one action is complete, move on to the action required afterwards (eg. after a slew, we must begin tracking)
The routines that interact with the hardware have not yet been written for the new drive PC,
and the hardware itself is not yet finalised. The current plan is for the drive PC to communicate
with a Rabbit PIC and an Allan-Bradley PLC. The Rabbit will use its serial ports to control one each
of the azimuth/X and elevation/Y drives, and to read the serial position encoders. This will require
at least 4 serial ports. The Rabbit will receive the rate that the drives should be running at from the
drive PC, and it will pass this on to the drives in the appropriate fashion. It will also poll the
encoders and immediately pass on the values to the drive PC.
The PLC will manage the activity of the slower systems, such as the limit switches, the panic buttons,
weather and wind information. This is because the PLC runs at a much slower cadence than the Rabbit and
drive PC do, and these states change less often. The PLC will send data to the drive PC, but the drive
PC should not have to send much, if any, data to the PLC.
The system job process, or
sysjob, is responsible for taking control of the antenna in an emergency situation
and moving it to a safe location.
sysjob task starts, it attaches to the event generator’s shared memory and requests that it receive
alarm signals from it. From then on, it wakes up at 100 Hz and checks shared memory (element 6) for an emergency
condition. This condition is set by
hartbeat when the wind gets too high (
ANTENNA_WIND_STOW), after an
encoder fault (
ANTENNA_ENCODER_PROBLEM) or a drives fault (
ANTENNA_DRIVES_PROBLEM). If there is no emergency
sysjob goes back to sleep waiting for another alarm signal from the event generator. If there is an
sysjob sets element 7 of shared memory to 1 to signal
hartbeat that it is handling the
If there has been an encoder or drives fault, then
sysjob issues an abort command and waits for the antenna to
stop moving, after which it commands the drives to turn off. If the wind is too high, then
sysjob commands the
antenna to park. Once the antenna is in a safe condition (off or parked),
sysjob sets element 7 of shared
memory to 0 to signal
hartbeat that it is finished handling the emergency.
Antenna Monitor and Control
The majority of software differences between PSOS and Linux versions of the drive PC are in the monitoring and control code.
Under PSOS, the drive PC code started its own socket server and listened for clients, while under Linux we have opted
xinetd to listen for clients on port 30384. When a client connects on this port on the drive PC,
ant_mon process, which then communicates with the network socket as if it were reading and writing to
ant_mon process first checks whether the client wants to control the telescope or get monitoring information,
and switches to either the routine
But whereas in the PSOS code the
hartbeat routine spent time each loop “servicing” each monitoring request
individually, the Linux code leaves this job to the
ant_mon routine, which can now access the telescope state
data through the
wrkspace structure in shared memory. Due to the relatively slow speed at which the PSOS
machines were running, it was decided that a maximum of 5 clients could get their monitoring requests serviced at
any particular time. With the Linux code, since the computer that will run the drive PC will be a great deal
faster, and since the
ant_mon processes will each run in their own thread, it should be possible to greatly
increase the number of maximum clients, and perhaps even unlimit it.
ant_mon routine is called it checks to see what the client is asking for, which should be one of:
ANT_SETUPLIST: the client wants to set up a new list, so it sends some info about which lists it wants
syspar lists, or a combination)
ANT_STARTMON: the client wants the drive PC to start collecting the list data
ANT_REQUESTLIST: the client wants to collect the data collected by the drive PC
ANT_RESETLIST: stop collecting data
ANT_KILLLIST: the client doesn’t care about the list any more
ANT_GETDIAL: the client wants to know the native dial coordinates of the given demanded position
Once the antenna monitor routine has received a request to begin collecting list data, it asks to begin
receiving alarm signals from the event generator so it can begin running at the same cadence as the
hartbeat task. This allows clients to get data at up to 100 Hz.
ant_ctrl task is called it first checks where the user is connecting from. If the user is connecting
from the local host (IP
127.0.0.1), then it is the
SYSTEM_USER and can override any other telescope user.
This is the case for the
SYSJOB process, so it can always turn the telescope off or park it in case of an
All other users get
REMOTE_USER privileges. This means, so long as no other user with greater control is using
the telescope, it can give commands to the drive system.
ant_ctrl routine connects to the client and listens for commands. When it gets one, it collects the data
sent to it and then calls the
driver routine to start the telescope going on what the user has asked it to do.
After this, the
hartbeat routine takes care of moving the telescope to its destination.
The control program needs to signal its controlling client when the telescope has completed its task. When the
hartbeat routine was in charge of all the control, and the
ant_ctrl routine merely signalled
to move the telescope, this task was easy. It is not much more difficult with the Linux drive PC, using some
shared memory in the
msg variable. This variable uses element 5 (
msg) to pass the PID of the
process controlling the telescope to the
driver routine which stores it as
gw→qid, which used to be the
queue ID from
ant_ctrl. After storing the PID,
ant_ctrl waits until it sees
msg has been set to the negative of its PID, which is what
driver does when the antenna has completed
its task. At this point
ant_ctrl sends back the antenna completion code to its client.
The only special case is the
ABORT command, which causes the
ant_ctrl task that was controlling the telescope
to immediately stop waiting and send back an aborted completion code, and causes the
ant_ctrl task giving the abort
command to wait until the telescope has reached an idle state.
The drive PC is usually a silent beast that has no visible on-screen output as all its tasks are run in the background
as services. Then there are the
ant_ctrl routines that communicate to their clients using
xinetd, so having messages output from this program can cause the network communication to break
down if care is not taken.
Each routine is therefore able to access the system log through some helper macros defined in
These macros call a routine that outputs to the system log. For the Linux drive PC, this log is found in
Organisation and Compilation
The new Linux drive PC code is part of the
/obs directory structure, as many of the libraries in it are required
by the drive PC. The new code is in
/obs/linux and consists of the source files
sysjob.c. It also requires the header files
During normal compilation with the flag
-Wall (report all warnings) many warning messages were emitted, mostly about
defined variables not being used. In order to prevent emission of these warnings, to make it clearer if a compilation error
does occur, each source file with warnings was given a routine called
nowarnings_hartbeat (for example) which used
these variables and thus stopped the warnings. These routines are not called from any of the code however.
The drive PC
The computer running the drive PC does not need to be anything special, and indeed it is not. The computer that will be used for
the first Linux drive PC has:
- Motherboard: Gigabyte GA-EG31M-S2
- Processor: Intel Celeron Dual Core E1400
- RAM: 1 GB Kingston - KVR800D2N5/1G
- HDD: 80 GB Western Digital WD800AAJS
- Optical: Lite-On - DH-16D3P
- PSU: 380W Antec - EA-380
- Case: Antec - NSK4000
This computer costs only ~$650 including GST. The main benefit that this computer has over the old computers running PSOS is the
hard drives, which should allow the drive PC to boot up cold in only a few tens of seconds (a virtual PC used for developing the
drive PC software was able to boot up into the system in 34 seconds).
The drive PC uses Arch Linux, which is a lightweight distribution that uses few resources but is also highly configurable
and is easy to maintain.
The drive PC services need to be started in a particular order. The startup scripts for the services are in
and the order that they are started up is configured on the last line of
The startup scripts, and their order are given below:
DRIVEPC_1_event_generator: this needs to be started first to establish the shared memory segments, and to start generating the alarm signals that the rest of the system needs to run properly.
DRIVEPC_2_hartbeat: the heartbeat is started up next to take control of the telescope’s systems.
xinetd: needs to be started to listen for incoming client connections; it will also start
ant_mon processes as required
DRIVEPC_3_sysjob: start the emergency control client after the client listener is started, and this should usually be the first client to connect, although it doesn’t need to be
DRIVEPC_4_perform: start the performance monitor last to ensure everything runs reliably
The startup scripts can be called manually as per normal Debian-type startup scripts, ie. with
restart as their argument.
Maintenance and troubleshooting