Linux Drive PC Code
Currently the drive PC runs on old computer hardware under the PSOS operating system. This hardware would be difficult to replace if it broke, and can not be easily upgraded to improve performance. The aim of porting the drive PC code to Linux is to allow the use of more general computer hardware that can be replaced or upgraded more easily in the future if required.
The primary constraint with the porting of the drive PC code is that it must behave exactly as the old code
does from the point of view of a client connecting to it. That is, it is undesirable to need to reprogram each
client program (such as vdesk
, bruce
, the FS antenna control etc.) to make it usable with the new
system.
The secondary constraint is that we want the drive PC to be completely free of non-standard hardware that would be required to control the telescope. In fact, we would like the drive PC to be "in contact" with as little telescope hardware as possible. Instead, we would prefer to rely on some other less accessible piece of standalone hardware, whose internal processes are very simple, to communicate with the hardware based on what the drive PC instructs it to do, and to communicate back to the drive PC what state the hardware is in.
Main differences between old and new drive PC codes
- Where the old drive PC required a
root
task to manage the startup process, the new drive PC uses the standard Linux startup scripts, and in fact has a standard Linux base. Currently, the new drive PC is based on Arch Linux. - The old drive PC uses semaphores to communicate between processes, while the new drive PC uses shared memory.
- The old drive PC has a network socket listener that is always running waiting for a connection to a client process. The new drive PC uses
xinetd
to spawn monitor and control processes that then no longer have to deal explicitly with network communication. - The old drive PC could only manage a few monitoring connections at a time. This in part is because every client required the
hartbeat
process to service the request, andhartbeat
was required to run completely within the 10 milliseconds between events. Because the workspace and system parameter structures are stored in shared memory in the new drive PC, it is possible to delegate the servicing of monitoring connections to theant_mon
processes. This should also allow the PC to accept far more monitoring connections.
Event Generator
The old drive PCs had a hardware event generator that generated an interrupt some number of times per second. This kept the cadence of the processing regular and allowed the drive PC to keep quite accurate time. The drive PC's event generator was set to generate interrupts 100 times per second.
The new drive PCs will have no trouble keeping time accurately, as NTP will do that for us. We still want a
regular cadence for the drive PC functions though, and it would be preferable to keep the 100 Hz cycle time.
So the new drive PC has a software event generator, which is called event_generator
. This program sets up an
internal alarm using the struct itimerval
timing structure and the setitimer
alarm function in GNU C.
This function takes the itimerval
structure, which has two elements, and asks for a SIGALRM
signal to be
given to it in a time it_val
(which can be specified in microseconds). After the SIGALRM
signal is
generated, the timer gets set automatically to the time it_interval
, which for us is set to 1/100 = 10,000
microseconds. Thus the cadence is always kept constant, as no code is required for our routine to reset the
timer.
Shared memory
Before all this however, the event generator also sets up a small area of shared memory. The shared memory segment
has a key of 100
, and is 8 integer values in size (or 32 bytes). Element 0 of this segment is used by other
processes that want to get events from the event generator. These other processes should attach to the shared
memory, and check that element 0 is set to 0. If it is, then they may insert their PID into this shared memory
element, and the event generator will then add this PID to its list of processes that it sends events to. The
"event" is actually a SIGALRM
signal.
Element 1 is set when event_generator
starts to be the frequency in Hz of the events that the event generator
will supply. Although no effort is made to ensure this value is read-only, it should be treated as such by other
processes.
Element 2 stores the PID of the perform
process, which is described later.
Element 3 stores the PID of the hartbeat
process, which is described later.
Element 4 is set when event_generator
starts to be the DUT1 correction, and as for element 1, it should be
considered read-only.
Element 5 stores the PID of the sysjob
process, which is described later.
Element 6 is used by the hartbeat
process to communicate to sysjob
what error condition has occurred.
Element 7 is set by sysjob
to be 1 when it is taking emergency action, and should be set to 0 at all other
times.
Generating events
Every time the timer produces a SIGALRM
signal, the event generator looks at shared memory element 0 to see
if another process has asked to be included, and adds its PID to the list if one has.
After this, the event generator goes through the list and sends a SIGALRM
signal to each process it knows
about. It should be noted therefore that although the event generator will always be triggered at a frequency
of 100 Hz, the time between successive alarm signals at the other processes is not guaranteed to be 10 milliseconds.
If the alarm signal can not be delivered, because the process no longer exists, or the event generator does not
have the proper authority to send signals to it, then the event generator will remove the PID from its list with
no warning.
Heartbeat
The heartbeat process is run in a program called hartbeat
, a leftover from the PSOS days when programs could only
have 8 character file names. The terms heartbeat and hartbeat
can and will be used interchangeably in this
document.
The heartbeat process is responsible for:
- calculating where the telescope should be at any particular moment, including applying pointing corrections
- determining the rate the drives should be moving at
- checking to see that no problems have occurred in the telescope's systems and that observing conditions are not dangerous
- ensuring that the telescope does not move past its operational limits
The hartbeat
task starts up first after the event generator and sets up four areas of shared memory. The observatory
code uses two structures to keep a track of what state the telescope is in (struct wrkspace
) and what the telescope is capable of
(struct syspar
). It was decided in the Linux version of the drive PC code that these two structures should be
kept in shared memory so that other processes could learn what state the telescope was in without having to directly
communicate with hartbeat
. This was mainly driven by a desire to make monitoring and controlling the telescope
less complicated that it was with the PSOS drive PC; this will be discussed in more detail later.
The wrkspace
structure is given the shared memory key 101, while the syspar
structure has key 102.
The Linux drive PC also uses shared memory to hold the drive action semaphore and for completion messaging, with key
103. This is a 6 int
sized memory area, where elements 0, 1 and 2 are for message passing, element 4 holds the
PID of the process waiting for the message and element 5 is the semaphore address.
The fourth shared memory segment (with key NEXTJOB_SHM_KEY) is required now that different tasks can all access the wrkspace
structure in
shared memory. This will be discussed in detail in the monitoring section below.
Upon starting, hartbeat
reads in the configuration file /obs/linux/cfg/host.cfg
, which should contain all the
information about what the telescope is capable of in terms of limits, drive type (XY or AzEl), speeds etc. The use
of this config file makes it easy to use the exact same drive PC code for many telescopes. At this point, the pointing
correction files (specified in host.cfg
are read in as well. The hardware state is initialised and then queried to
see what state the antenna is in, and the time is set. As an interesting note, hartbeat
counts every time it gets
an event from the event generator, which should be 100 times per second. It is thus possible to determine how long the
hartbeat
task has been running by looking at this number (hartbeat_count
), which should increment 8,640,000 times in one day.
The signal SIGALRM
is set up to trigger the hartbeat
routine, and then the process waits for its events to
arrive.
The hartbeat
routine does the following:
- queries the hardware (not yet implemented 2008/12/05 JBS)
- increments the
hartbeat_count
- gets the time from the PCs internal clock to millisecond accuracy
- updates the
wrkspace
structure with the current hardware state (drives, focus, limits, panic buttons, temperatures etc.) - calculates the ephemeris and sidereal times
- update weather information
- read the encoders, and determine which wrap to be in (for AzEl antenna)
- check encoder values for consistency to detect glitches, and call
sysjob
if a glitch is detected - check the drives for problems, and call
sysjob
if problems are detected - check the wind speed to see if we should be wind stowed, and call
sysjob
if a wind stow is warranted - apply pointing corrections
- convert telescope native coordinates to all other coordinate systems
- call the
heartbeat_drive_action
routine to control the drives (will describe this further below) - apply acceleration limits
- calculate what we should be sending to the drives and ensure it is valid
- tell the drives to move at the calculated rate
- update the focus platform
- generate the flashing lights on the debug box
- signal
perform
that we have completed our tasks in time
The heartbeat_drive_action
routine (in source file driver.c
) does the following:
- check if someone else has control of the drives using the drive action semaphore shared memory location, and exit immediately if this is the case
- calculate the stopping distance in native coordinates
- update the target position and rate depending on what the telescope is supposed to be doing (ie. tracking, slewing, scanning etc.)
- calculate the velocity of the target position in native coordinates, and check that the target position and stopping positions are inside the telescope limits, aborting the operation if they are not
- if one action is complete, move on to the action required afterwards (eg. after a slew, we must begin tracking)
The routines that interact with the hardware have not yet been written for the new drive PC, and the hardware itself is not yet finalised. The current plan is for the drive PC to communicate with a Rabbit PIC and an Allan-Bradley PLC. The Rabbit will use its serial ports to control one each of the azimuth/X and elevation/Y drives, and to read the serial position encoders. This will require at least 4 serial ports. The Rabbit will receive the rate that the drives should be running at from the drive PC, and it will pass this on to the drives in the appropriate fashion. It will also poll the encoders and immediately pass on the values to the drive PC.
The PLC will manage the activity of the slower systems, such as the limit switches, the panic buttons, weather and wind information. This is because the PLC runs at a much slower cadence than the Rabbit and drive PC do, and these states change less often. The PLC will send data to the drive PC, but the drive PC should not have to send much, if any, data to the PLC.
System Job
The system job process, or sysjob
, is responsible for taking control of the antenna in an emergency situation
and moving it to a safe location.
After sysjob
task starts, it attaches to the event generator's shared memory and requests that it receive
alarm signals from it. From then on, it wakes up at 100 Hz and checks shared memory (element 6) for an emergency
condition. This condition is set by hartbeat
when the wind gets too high (ANTENNA_WIND_STOW
), after an
encoder fault (ANTENNA_ENCODER_PROBLEM
) or a drives fault (ANTENNA_DRIVES_PROBLEM
). If there is no emergency
condition, sysjob
goes back to sleep waiting for another alarm signal from the event generator. If there is an
emergency condition, sysjob
sets element 7 of shared memory to 1 to signal hartbeat
that it is handling the
emergency.
If there has been an encoder or drives fault, then sysjob
issues an abort command and waits for the antenna to
stop moving, after which it commands the drives to turn off. If the wind is too high, then sysjob
commands the
antenna to park. Once the antenna is in a safe condition (off or parked), sysjob
sets element 7 of shared
memory to 0 to signal hartbeat
that it is finished handling the emergency.
Antenna Monitor and Control
The majority of software differences between PSOS and Linux versions of the drive PC are in the monitoring and control code.
Under PSOS, the drive PC code started its own socket server and listened for clients, while under Linux we have opted
to use xinetd
to listen for clients on port 30384. When a client connects on this port on the drive PC, xinetd
starts an ant_mon
process, which then communicates with the network socket as if it were reading and writing to
STDIN
and STDOUT
respectively.
The ant_mon
process first checks whether the client wants to control the telescope or get monitoring information,
and switches to either the routine ant_ctrl
or ant_mon
respectively.
But whereas in the PSOS code the hartbeat
routine spent time each loop "servicing" each monitoring request
individually, the Linux code leaves this job to the ant_mon
routine, which can now access the telescope state
data through the wrkspace
structure in shared memory. Due to the relatively slow speed at which the PSOS
machines were running, it was decided that a maximum of 5 clients could get their monitoring requests serviced at
any particular time. With the Linux code, since the computer that will run the drive PC will be a great deal
faster, and since the ant_mon
processes will each run in their own thread, it should be possible to greatly
increase the number of maximum clients, and perhaps even unlimit it.
Monitoring
When the ant_mon
routine is called it checks to see what the client is asking for, which should be one of:
ANT_SETUPLIST
: the client wants to set up a new list, so it sends some info about which lists it wants
(from the wrkspace
or syspar
lists, or a combination)
ANT_STARTMON
: the client wants the drive PC to start collecting the list dataANT_REQUESTLIST
: the client wants to collect the data collected by the drive PCANT_RESETLIST
: stop collecting dataANT_KILLLIST
: the client doesn't care about the list any moreANT_GETDIAL
: the client wants to know the native dial coordinates of the given demanded position
Once the antenna monitor routine has received a request to begin collecting list data, it asks to begin
receiving alarm signals from the event generator so it can begin running at the same cadence as the
hartbeat
task. This allows clients to get data at up to 100 Hz.
Control
When the ant_ctrl
task is called it first checks where the user is connecting from. If the user is connecting
from the local host (IP 127.0.0.1
), then it is the SYSTEM_USER
and can override any other telescope user.
This is the case for the SYSJOB
process, so it can always turn the telescope off or park it in case of an
emergency.
All other users get REMOTE_USER
privileges. This means, so long as no other user with greater control is using
the telescope, it can give commands to the drive system.
The ant_ctrl
routine connects to the client and listens for commands. When it gets one, it collects the data
sent to it and then calls the driver
routine to start the telescope going on what the user has asked it to do.
After this, the hartbeat
routine takes care of moving the telescope to its destination.
The control program needs to signal its controlling client when the telescope has completed its task. When the
hartbeat
routine was in charge of all the control, and the ant_ctrl
routine merely signalled hartbeat
to move the telescope, this task was easy. It is not much more difficult with the Linux drive PC, using some
shared memory in the msg
variable. This variable uses element 5 (msg[4]
) to pass the PID of the ant_mon
process controlling the telescope to the driver
routine which stores it as gw->qid
, which used to be the
queue ID from ant_ctrl
. After storing the PID, driver
blanks msg[4]
and ant_ctrl
waits until it sees
that msg[4]
has been set to the negative of its PID, which is what driver
does when the antenna has completed
its task. At this point ant_ctrl
sends back the antenna completion code to its client.
The only special case is the ABORT
command, which causes the ant_ctrl
task that was controlling the telescope
to immediately stop waiting and send back an aborted completion code, and causes the ant_ctrl
task giving the abort
command to wait until the telescope has reached an idle state.
Performance Monitoring
Message Logging
The drive PC is usually a silent beast that has no visible on-screen output as all its tasks are run in the background
as services. Then there are the ant_mon
and ant_ctrl
routines that communicate to their clients using STDIN
and STDOUT
via xinetd
, so having messages output from this program can cause the network communication to break
down if care is not taken.
Each routine is therefore able to access the system log through some helper macros defined in /obs/generic/sysmsg/root_msg.h
.
These macros call a routine that outputs to the system log. For the Linux drive PC, this log is found in
/var/log/everything.log
.
Organisation and Compilation
The new Linux drive PC code is part of the /obs
directory structure, as many of the libraries in it are required
by the drive PC. The new code is in /obs/linux
and consists of the source files ant_ctrl.c
, ant_mon.c
,
coords.c
, display.c
, driver.c
, drives.c
, encoder.c
, event_generator.c
, focus.c
, hardware.c
,
hartbeat.c
, heartmon.c
, init_ant.c
, newday.c
, nextjob.c
, packmon.c
, perform.c
, pointing.c
,
sys_cfg.c
and sysjob.c
. It also requires the header files ant_cmd.h
, coords.h
, driver.h
, drives.h
,
focus.h
, newday.h
, nextjob.h
, packmon.h
, pointing.h
and sys_cfg.h
.
The Makefile
compiles event_generator
, perform
, sysjob
, hartbeat
and ant_mon
.
During normal compilation with the flag -Wall
(report all warnings) many warning messages were emitted, mostly about
defined variables not being used. In order to prevent emission of these warnings, to make it clearer if a compilation error
does occur, each source file with warnings was given a routine called nowarnings_hartbeat
(for example) which used
these variables and thus stopped the warnings. These routines are not called from any of the code however.
The drive PC
Computer specifications
The computer running the drive PC does not need to be anything special, and indeed it is not. The computer that will be used for the first Linux drive PC has:
- Motherboard: Gigabyte GA-EG31M-S2
- Processor: Intel Celeron Dual Core E1400
- RAM: 1 GB Kingston - KVR800D2N5/1G
- HDD: 80 GB Western Digital WD800AAJS
- Optical: Lite-On - DH-16D3P
- PSU: 380W Antec - EA-380
- Case: Antec - NSK4000
This computer costs only ~$650 including GST. The main benefit that this computer has over the old computers running PSOS is the hard drives, which should allow the drive PC to boot up cold in only a few tens of seconds (a virtual PC used for developing the drive PC software was able to boot up into the system in 34 seconds).
Linux
The drive PC uses Arch Linux, which is a lightweight distribution that uses few resources but is also highly configurable and is easy to maintain.
Startup sequence
The drive PC services need to be started in a particular order. The startup scripts for the services are in /etc/rc.d
,
and the order that they are started up is configured on the last line of /etc/rc.conf
.
The startup scripts, and their order are given below:
DRIVEPC_1_event_generator
: this needs to be started first to establish the shared memory segments, and to start generating the alarm signals that the rest of the system needs to run properly.DRIVEPC_2_hartbeat
: the heartbeat is started up next to take control of the telescope's systems.xinetd
: needs to be started to listen for incoming client connections; it will also startant_mon
processes as requiredDRIVEPC_3_sysjob
: start the emergency control client after the client listener is started, and this should usually be the first client to connect, although it doesn't need to beDRIVEPC_4_perform
: start the performance monitor last to ensure everything runs reliably
The startup scripts can be called manually as per normal Debian-type startup scripts, ie. with start
, stop
, or restart
as their argument.