Linux Drive PC Code

Currently the drive PC runs on old computer hardware under the PSOS operating system. This hardware would be difficult to replace if it broke, and can not be easily upgraded to improve performance. The aim of porting the drive PC code to Linux is to allow the use of more general computer hardware that can be replaced or upgraded more easily in the future if required.

The primary constraint with the porting of the drive PC code is that it must behave exactly as the old code does from the point of view of a client connecting to it. That is, it is undesirable to need to reprogram each client program (such as vdesk, bruce, the FS antenna control etc.) to make it usable with the new system.

The secondary constraint is that we want the drive PC to be completely free of non-standard hardware that would be required to control the telescope. In fact, we would like the drive PC to be "in contact" with as little telescope hardware as possible. Instead, we would prefer to rely on some other less accessible piece of standalone hardware, whose internal processes are very simple, to communicate with the hardware based on what the drive PC instructs it to do, and to communicate back to the drive PC what state the hardware is in.

Main differences between old and new drive PC codes

  • Where the old drive PC required a root task to manage the startup process, the new drive PC uses the standard Linux startup scripts, and in fact has a standard Linux base. Currently, the new drive PC is based on Arch Linux.
  • The old drive PC uses semaphores to communicate between processes, while the new drive PC uses shared memory.
  • The old drive PC has a network socket listener that is always running waiting for a connection to a client process. The new drive PC uses xinetd to spawn monitor and control processes that then no longer have to deal explicitly with network communication.
  • The old drive PC could only manage a few monitoring connections at a time. This in part is because every client required the hartbeat process to service the request, and hartbeat was required to run completely within the 10 milliseconds between events. Because the workspace and system parameter structures are stored in shared memory in the new drive PC, it is possible to delegate the servicing of monitoring connections to the ant_mon processes. This should also allow the PC to accept far more monitoring connections.

Event Generator

The old drive PCs had a hardware event generator that generated an interrupt some number of times per second. This kept the cadence of the processing regular and allowed the drive PC to keep quite accurate time. The drive PC's event generator was set to generate interrupts 100 times per second.

The new drive PCs will have no trouble keeping time accurately, as NTP will do that for us. We still want a regular cadence for the drive PC functions though, and it would be preferable to keep the 100 Hz cycle time. So the new drive PC has a software event generator, which is called event_generator. This program sets up an internal alarm using the struct itimerval timing structure and the setitimer alarm function in GNU C. This function takes the itimerval structure, which has two elements, and asks for a SIGALRM signal to be given to it in a time it_val (which can be specified in microseconds). After the SIGALRM signal is generated, the timer gets set automatically to the time it_interval, which for us is set to 1/100 = 10,000 microseconds. Thus the cadence is always kept constant, as no code is required for our routine to reset the timer.

Shared memory

Before all this however, the event generator also sets up a small area of shared memory. The shared memory segment has a key of 100, and is 8 integer values in size (or 32 bytes). Element 0 of this segment is used by other processes that want to get events from the event generator. These other processes should attach to the shared memory, and check that element 0 is set to 0. If it is, then they may insert their PID into this shared memory element, and the event generator will then add this PID to its list of processes that it sends events to. The "event" is actually a SIGALRM signal.

Element 1 is set when event_generator starts to be the frequency in Hz of the events that the event generator will supply. Although no effort is made to ensure this value is read-only, it should be treated as such by other processes.

Element 2 stores the PID of the perform process, which is described later.

Element 3 stores the PID of the hartbeat process, which is described later.

Element 4 is set when event_generator starts to be the DUT1 correction, and as for element 1, it should be considered read-only.

Element 5 stores the PID of the sysjob process, which is described later.

Element 6 is used by the hartbeat process to communicate to sysjob what error condition has occurred.

Element 7 is set by sysjob to be 1 when it is taking emergency action, and should be set to 0 at all other times.

Generating events

Every time the timer produces a SIGALRM signal, the event generator looks at shared memory element 0 to see if another process has asked to be included, and adds its PID to the list if one has.

After this, the event generator goes through the list and sends a SIGALRM signal to each process it knows about. It should be noted therefore that although the event generator will always be triggered at a frequency of 100 Hz, the time between successive alarm signals at the other processes is not guaranteed to be 10 milliseconds. If the alarm signal can not be delivered, because the process no longer exists, or the event generator does not have the proper authority to send signals to it, then the event generator will remove the PID from its list with no warning.

Heartbeat

The heartbeat process is run in a program called hartbeat, a leftover from the PSOS days when programs could only have 8 character file names. The terms heartbeat and hartbeat can and will be used interchangeably in this document.

The heartbeat process is responsible for:

  • calculating where the telescope should be at any particular moment, including applying pointing corrections
  • determining the rate the drives should be moving at
  • checking to see that no problems have occurred in the telescope's systems and that observing conditions are not dangerous
  • ensuring that the telescope does not move past its operational limits

The hartbeat task starts up first after the event generator and sets up four areas of shared memory. The observatory code uses two structures to keep a track of what state the telescope is in (struct wrkspace) and what the telescope is capable of (struct syspar). It was decided in the Linux version of the drive PC code that these two structures should be kept in shared memory so that other processes could learn what state the telescope was in without having to directly communicate with hartbeat. This was mainly driven by a desire to make monitoring and controlling the telescope less complicated that it was with the PSOS drive PC; this will be discussed in more detail later.

The wrkspace structure is given the shared memory key 101, while the syspar structure has key 102.

The Linux drive PC also uses shared memory to hold the drive action semaphore and for completion messaging, with key 103. This is a 6 int sized memory area, where elements 0, 1 and 2 are for message passing, element 4 holds the PID of the process waiting for the message and element 5 is the semaphore address.

The fourth shared memory segment (with key NEXTJOB_SHM_KEY) is required now that different tasks can all access the wrkspace structure in shared memory. This will be discussed in detail in the monitoring section below.

Upon starting, hartbeat reads in the configuration file /obs/linux/cfg/host.cfg, which should contain all the information about what the telescope is capable of in terms of limits, drive type (XY or AzEl), speeds etc. The use of this config file makes it easy to use the exact same drive PC code for many telescopes. At this point, the pointing correction files (specified in host.cfg are read in as well. The hardware state is initialised and then queried to see what state the antenna is in, and the time is set. As an interesting note, hartbeat counts every time it gets an event from the event generator, which should be 100 times per second. It is thus possible to determine how long the hartbeat task has been running by looking at this number (hartbeat_count), which should increment 8,640,000 times in one day.

The signal SIGALRM is set up to trigger the hartbeat routine, and then the process waits for its events to arrive.

The hartbeat routine does the following:

  • queries the hardware (not yet implemented 2008/12/05 JBS)
  • increments the hartbeat_count
  • gets the time from the PCs internal clock to millisecond accuracy
  • updates the wrkspace structure with the current hardware state (drives, focus, limits, panic buttons, temperatures etc.)
  • calculates the ephemeris and sidereal times
  • update weather information
  • read the encoders, and determine which wrap to be in (for AzEl antenna)
  • check encoder values for consistency to detect glitches, and call sysjob if a glitch is detected
  • check the drives for problems, and call sysjob if problems are detected
  • check the wind speed to see if we should be wind stowed, and call sysjob if a wind stow is warranted
  • apply pointing corrections
  • convert telescope native coordinates to all other coordinate systems
  • call the heartbeat_drive_action routine to control the drives (will describe this further below)
  • apply acceleration limits
  • calculate what we should be sending to the drives and ensure it is valid
  • tell the drives to move at the calculated rate
  • update the focus platform
  • generate the flashing lights on the debug box
  • signal perform that we have completed our tasks in time

The heartbeat_drive_action routine (in source file driver.c) does the following:

  • check if someone else has control of the drives using the drive action semaphore shared memory location, and exit immediately if this is the case
  • calculate the stopping distance in native coordinates
  • update the target position and rate depending on what the telescope is supposed to be doing (ie. tracking, slewing, scanning etc.)
  • calculate the velocity of the target position in native coordinates, and check that the target position and stopping positions are inside the telescope limits, aborting the operation if they are not
  • if one action is complete, move on to the action required afterwards (eg. after a slew, we must begin tracking)

The routines that interact with the hardware have not yet been written for the new drive PC, and the hardware itself is not yet finalised. The current plan is for the drive PC to communicate with a Rabbit PIC and an Allan-Bradley PLC. The Rabbit will use its serial ports to control one each of the azimuth/X and elevation/Y drives, and to read the serial position encoders. This will require at least 4 serial ports. The Rabbit will receive the rate that the drives should be running at from the drive PC, and it will pass this on to the drives in the appropriate fashion. It will also poll the encoders and immediately pass on the values to the drive PC.

The PLC will manage the activity of the slower systems, such as the limit switches, the panic buttons, weather and wind information. This is because the PLC runs at a much slower cadence than the Rabbit and drive PC do, and these states change less often. The PLC will send data to the drive PC, but the drive PC should not have to send much, if any, data to the PLC.

System Job

The system job process, or sysjob, is responsible for taking control of the antenna in an emergency situation and moving it to a safe location.

After sysjob task starts, it attaches to the event generator's shared memory and requests that it receive alarm signals from it. From then on, it wakes up at 100 Hz and checks shared memory (element 6) for an emergency condition. This condition is set by hartbeat when the wind gets too high (ANTENNA_WIND_STOW), after an encoder fault (ANTENNA_ENCODER_PROBLEM) or a drives fault (ANTENNA_DRIVES_PROBLEM). If there is no emergency condition, sysjob goes back to sleep waiting for another alarm signal from the event generator. If there is an emergency condition, sysjob sets element 7 of shared memory to 1 to signal hartbeat that it is handling the emergency.

If there has been an encoder or drives fault, then sysjob issues an abort command and waits for the antenna to stop moving, after which it commands the drives to turn off. If the wind is too high, then sysjob commands the antenna to park. Once the antenna is in a safe condition (off or parked), sysjob sets element 7 of shared memory to 0 to signal hartbeat that it is finished handling the emergency.

Antenna Monitor and Control

The majority of software differences between PSOS and Linux versions of the drive PC are in the monitoring and control code. Under PSOS, the drive PC code started its own socket server and listened for clients, while under Linux we have opted to use xinetd to listen for clients on port 30384. When a client connects on this port on the drive PC, xinetd starts an ant_mon process, which then communicates with the network socket as if it were reading and writing to STDIN and STDOUT respectively.

The ant_mon process first checks whether the client wants to control the telescope or get monitoring information, and switches to either the routine ant_ctrl or ant_mon respectively.

But whereas in the PSOS code the hartbeat routine spent time each loop "servicing" each monitoring request individually, the Linux code leaves this job to the ant_mon routine, which can now access the telescope state data through the wrkspace structure in shared memory. Due to the relatively slow speed at which the PSOS machines were running, it was decided that a maximum of 5 clients could get their monitoring requests serviced at any particular time. With the Linux code, since the computer that will run the drive PC will be a great deal faster, and since the ant_mon processes will each run in their own thread, it should be possible to greatly increase the number of maximum clients, and perhaps even unlimit it.

Monitoring

When the ant_mon routine is called it checks to see what the client is asking for, which should be one of:

  • ANT_SETUPLIST: the client wants to set up a new list, so it sends some info about which lists it wants

(from the wrkspace or syspar lists, or a combination)

  • ANT_STARTMON: the client wants the drive PC to start collecting the list data
  • ANT_REQUESTLIST: the client wants to collect the data collected by the drive PC
  • ANT_RESETLIST: stop collecting data
  • ANT_KILLLIST: the client doesn't care about the list any more
  • ANT_GETDIAL: the client wants to know the native dial coordinates of the given demanded position

Once the antenna monitor routine has received a request to begin collecting list data, it asks to begin receiving alarm signals from the event generator so it can begin running at the same cadence as the hartbeat task. This allows clients to get data at up to 100 Hz.

Control

When the ant_ctrl task is called it first checks where the user is connecting from. If the user is connecting from the local host (IP 127.0.0.1), then it is the SYSTEM_USER and can override any other telescope user. This is the case for the SYSJOB process, so it can always turn the telescope off or park it in case of an emergency.

All other users get REMOTE_USER privileges. This means, so long as no other user with greater control is using the telescope, it can give commands to the drive system.

The ant_ctrl routine connects to the client and listens for commands. When it gets one, it collects the data sent to it and then calls the driver routine to start the telescope going on what the user has asked it to do. After this, the hartbeat routine takes care of moving the telescope to its destination.

The control program needs to signal its controlling client when the telescope has completed its task. When the hartbeat routine was in charge of all the control, and the ant_ctrl routine merely signalled hartbeat to move the telescope, this task was easy. It is not much more difficult with the Linux drive PC, using some shared memory in the msg variable. This variable uses element 5 (msg[4]) to pass the PID of the ant_mon process controlling the telescope to the driver routine which stores it as gw->qid, which used to be the queue ID from ant_ctrl. After storing the PID, driver blanks msg[4] and ant_ctrl waits until it sees that msg[4] has been set to the negative of its PID, which is what driver does when the antenna has completed its task. At this point ant_ctrl sends back the antenna completion code to its client.

The only special case is the ABORT command, which causes the ant_ctrl task that was controlling the telescope to immediately stop waiting and send back an aborted completion code, and causes the ant_ctrl task giving the abort command to wait until the telescope has reached an idle state.

Performance Monitoring

Message Logging

The drive PC is usually a silent beast that has no visible on-screen output as all its tasks are run in the background as services. Then there are the ant_mon and ant_ctrl routines that communicate to their clients using STDIN and STDOUT via xinetd, so having messages output from this program can cause the network communication to break down if care is not taken.

Each routine is therefore able to access the system log through some helper macros defined in /obs/generic/sysmsg/root_msg.h. These macros call a routine that outputs to the system log. For the Linux drive PC, this log is found in /var/log/everything.log.

Organisation and Compilation

The new Linux drive PC code is part of the /obs directory structure, as many of the libraries in it are required by the drive PC. The new code is in /obs/linux and consists of the source files ant_ctrl.c, ant_mon.c, coords.c, display.c, driver.c, drives.c, encoder.c, event_generator.c, focus.c, hardware.c, hartbeat.c, heartmon.c, init_ant.c, newday.c, nextjob.c, packmon.c, perform.c, pointing.c, sys_cfg.c and sysjob.c. It also requires the header files ant_cmd.h, coords.h, driver.h, drives.h, focus.h, newday.h, nextjob.h, packmon.h, pointing.h and sys_cfg.h.

The Makefile compiles event_generator, perform, sysjob, hartbeat and ant_mon.

During normal compilation with the flag -Wall (report all warnings) many warning messages were emitted, mostly about defined variables not being used. In order to prevent emission of these warnings, to make it clearer if a compilation error does occur, each source file with warnings was given a routine called nowarnings_hartbeat (for example) which used these variables and thus stopped the warnings. These routines are not called from any of the code however.

The drive PC

Computer specifications

The computer running the drive PC does not need to be anything special, and indeed it is not. The computer that will be used for the first Linux drive PC has:

  • Motherboard: Gigabyte GA-EG31M-S2
  • Processor: Intel Celeron Dual Core E1400
  • RAM: 1 GB Kingston - KVR800D2N5/1G
  • HDD: 80 GB Western Digital WD800AAJS
  • Optical: Lite-On - DH-16D3P
  • PSU: 380W Antec - EA-380
  • Case: Antec - NSK4000

This computer costs only ~$650 including GST. The main benefit that this computer has over the old computers running PSOS is the hard drives, which should allow the drive PC to boot up cold in only a few tens of seconds (a virtual PC used for developing the drive PC software was able to boot up into the system in 34 seconds).

Linux

The drive PC uses Arch Linux, which is a lightweight distribution that uses few resources but is also highly configurable and is easy to maintain.

Startup sequence

The drive PC services need to be started in a particular order. The startup scripts for the services are in /etc/rc.d, and the order that they are started up is configured on the last line of /etc/rc.conf.

The startup scripts, and their order are given below:

  1. DRIVEPC_1_event_generator: this needs to be started first to establish the shared memory segments, and to start generating the alarm signals that the rest of the system needs to run properly.
  2. DRIVEPC_2_hartbeat: the heartbeat is started up next to take control of the telescope's systems.
  3. xinetd: needs to be started to listen for incoming client connections; it will also start ant_mon processes as required
  4. DRIVEPC_3_sysjob: start the emergency control client after the client listener is started, and this should usually be the first client to connect, although it doesn't need to be
  5. DRIVEPC_4_perform: start the performance monitor last to ensure everything runs reliably

The startup scripts can be called manually as per normal Debian-type startup scripts, ie. with start, stop, or restart as their argument.

Maintenance and troubleshooting