Latest Entries

Tx filtering reworked

I have revised my RC filter because the previous oversampling didn’t perfectly divide the digital clock which is BAD for continuous tracking.

I have empirically verified that things are good now and that this is better in the SNR-sense than a simple bandpass filter.

A minor note: spotted yet another Xilinx bug—when comparing an std_logic_vector to a constant, the number of bits has to match!

if ( ovrsmpl_count = "01001" ) then 
    ovrsmpl_limit <= '1';
else
    ovrsmpl_limit <= '0';
end if;

Can’t remember if this is something new or yet another release-dependent peculiarity.

fixed-point issues – again

My block-floating point (BFP) formatter is now more or less there. I need now to use non-blocking reads/writes as to realize ping-pong buffering scheme that can overlap number crunching. For now, things got a lot more stable in the fixed-point sense and I’m no longer using any rounding at all.

A small note: The barrel shifter can be further optimized if I hard-code possible shifts in conjunction with the SNR I expect to see. The Xilinx XtremeDSP manual shows that an 18-bit barrel shifter takes two slices. Doubles when complex.

fixed-point issues

Having manually merged the channel processing loops for better HLS, I am now tying the knots of the fixed-point algorithm. Channel and chip estimation and despreading now all work properly. The last revision I made w.r.t this is the following:

acqParams.E0 = ACC;
E0_reciprocal_type num0, num1;
//! reinterpret channel power
num0(17, 0) = acqParams.E0(17, 0);
//! reinterpret fractional ONE
num1(17, 0) = ( ( 1 << 17 ) - 1 );
acqParams.E0_reciprocal = num1/num0;

As can be seen above, I get the accumulated power into an 18-bit fractional value. This is then reinterpreted into a similar length unsigned integer. The power reciprocal is then obtained from dividing the two’s complement positive one by the reinterpreted fractional value.

In the chip estimate function I do:

//! scale by channel power
Y = ACC;
Y.real() *= acqParams.E0_reciprocal;
Y.imag() *= acqParams.E0_reciprocal;
acqParams.q_hat[cdma.cyc_kTc] = Y;

Also for despreading, I’m no longer multiplying:

acqParams.d_hat.real() = ACC.real() >> 9;
acqParams.d_hat.imag() = ACC.imag() >> 9;

Remember:
———-
Whatever error introduced throughout various optimizations has to be accounted for in your acquisition test.

third HLS pass

Here is the latest revision of the algorithm with slight modifications to previous high-level directives.

HLS 3rd pass

Comments:

  • v_bar_code_loop is now left rolled. Only the chip nested loop is fully unrolled.
  • the chip oversampled rate now appears in literally all directives involving loops and memory partitioning.
  • the last loop despreading_loop is super cheap to even fully unroll. Thus it makes sense to use this loop to buy myself more time if needed. Still, for cosmetic reasons, I opted to partially unroll by the same pervasive (term is used with extreme caution) factor which occurs throughout all directives; chip oversampling rate.

With the revised set of directives, I’m now almost 200 clock cycles away from adaptation deadline. Area has gone down very slightly since I’m no longer partially unrolling v_bar_code_loop.

HLS – PAR’ed

Here is my first fully place-and-routed algorithm; undebugged.

HLS impl

Comments:

  • I reverted to unified ‘chip throttling.’ As such all memories are BRAMs partitioned cyclically by the chip oversampling rate, and all loops are unrolled by also the chip oversampling rate for maximum data parallelism
  • Memory access bottlenecks and subsequently pipeline stalls on circular signal vectors are tolerated as opposed to going fully partitioned. This is mostly because of the tremendous generic fabric cost not to mention the very long logic delays on these rather deep vectors
  • In order to make up for these bottlenecks, two more loops have been partially unrolled by a factor of 2, namely v_bar_code_loop and despreading_loop. Also both the Gold code and the code estimate sequences have been partitioned by a factor of two to sustain the data rate in these loops. The rationale is that these are relatively cheap to do given the binary nature of the code and that only an adder/subtracter has to be duplicated per loop, albeit for the oversampled chip (4x) in v_bar_code_loop

As can be seen, timing is almost met right away but I have to trace back in static timing the path that is resulting in this extra ~0.6 ns somewhere, which is not a big deal. Priority now is for sorting out the precision first so that I don’t end up doing so more than once.

second HLS pass – undebugged

Alright, slowly optimizing away. Latest numbers including the decision directed PLL now are:

HLS second pass

Major enhancements:
(1) going fully partitioned for the DSSS circular data arrays

set_directive_resource -core RAM_async algo_fixed_type<D,F,C,FM>::exe cdma.ds.v.vect
set_directive_resource -core RAM_async algo_fixed_type<D,F,C,FM>::exe cdma.ds.v_bar_hat.vect

(2) combining the circular arrays in a wider global array in order to save up on control and decode logic

set_directive_array_map -instance cdma_ds_circ_vect -mode vertical algo_fixed_type<D,F,C,FM>::exe cdma.ds.v.vect
set_directive_array_map -instance cdma_ds_circ_vect -mode vertical algo_fixed_type<D,F,C,FM>::exe cdma.ds.v_bar_hat.vect

(3) streaming arrays whose indices are constant throughout

set_directive_array_stream algo_fixed_type<D,F,C,FM>::exe cdma.ds.h_hat_vect
set_directive_array_stream algo_fixed_type<D,F,C,FM>::exe cdma.ds.v0_hat_vect
set_directive_array_stream algo_fixed_type<D,F,C,FM>::exe acqParams.h_hat_mag_vect

first HLS pass – revisited

I’ve now using the newest release of AutoPilot. After a bit of grief to do with an issue I’ll comment on in a minute, pressing the synthesis button on roughly the same source code resulted in:

HLS first pass revisited

The tool seems to have shaven off quite a few FFs from the final count in comparison with the old release. Power consumption during acquisition has gone up a little bit too.

I’d initially been using nested namespace’s to aggregate some algorithmic parameters shared among all functions. The new release does not like this and refuses to compile. So I had to put all static constant parameters in a structure instead and add a .cpp file as well for non-integral type assignments. The new params.cpp file has to be added to compilation as well both in commandline simulation and for AutoPilot gui. I also suspect that the decreased FF count could be explained by locally keeping copies of these global parameters in LUTs which don’t require registers. See High-level Synthesis Blue Book for more discussion on this.

The new release has also a much more stable cdfg diagram which is quite handy to say the least!

first HLS pass – undebugged/unoptimized

This

HLS first pass

at least proves for the first time that real-time adaptive acquisition for airborne broadband ultrasound is feasible. Currently it can run at about 23.7kHz above the needed 20kHz.

Next step is to debug the fixed-point algorithm and arrive at the needed precision for templatized data types using real datasets.

halfway HLS

Halfway through HL synthesizing, here are the latest metrics:

HLS halfway

I’m also halfway through clock budget. What remains, as intensive as it is, can afford quite a bit of functional pipelining.
I’m also still wrestling with this directive:

set_directive_expression_balance algo_fixed_type<D,F,C,FM>::fix_math_type::cmplx_mag<fract_T,cmplx_T,coef_T>/cmag_region

So yet more template quirks.
Note that in order for the directive to go through, you can’t use space between generics.

Final remark:
h = fxMath.template cmplx_mag(ch);
Note the use of .template in the case of a templatized function within the wrapper algorithmic class.

RE: out-of-box experience

area

out-of-box experience

I have pressed the synthesis button on my algorithm and managed to get a feel of how much more lies ahead :) . Still exciting new development with interesting final challenges. Stay put, will post numbers in the future.

CS Anthorn bus multiplexing

Think I’ll be using an and-or multiplexing for the 8 ADCs and hopefully this will work nicely right away.

CS Anthorn

A nice, clean way of interfacing my CS Anthorn to my PC is the following:
- wrap up my acquisition logic as a “Black Box HDL” and carry it to SysGen
- use shared memories in co-simulation mode with some statemachines
- script everything in matlab and off you go

5v tolerant I/O on ml403

After the usual half a day overhead, I now know I should use J3, pin 32 (EXP_IIC_SDA) as a 5 volt tolerant input from the NI acquisition card. It has a level shifting transistor originally intended for external I2C cascade. Should do the trick nicely I guess.

Doppler Experiment C#

Just finished coding my Doppler experiment in C#. The structure of the program is as follows:

namespace DopplerExperiment
{

public struct Fiducial ...

public delegate void FiduMovingEventHandler ...

public class FiduMovingEventArgs : EventArgs ...

public partial class reacTIVisonEx : Form, TuioListener ...

public class AcqNI ...

}

Apart from structure Fiducial, I have two classes reacTIVisonEx and AcqNI synchronized by the event FiduMovingEventArgs. The implementation of such interprocess communication is well documented in the .NET Framework SDK Documentation.
Fianlly, a quick note that I crudely measured the latency between the firing of an event and the start of acquisition and it seems to be one millisecond (616 – 615) which seems to be reasonable. So I will proceed and collect data tomorrow.

Doppler experiment setup

Just finished setting up the groundwork for collecting Doppler data. Below is the lego mindstorm vehicle which I’ll be using to this end.

Shown also the transmitter held in the front claw and the fiducial tag for vision-based tracking used to generate the ground truth.

Plotting CIR in MATLAB

I’m now using MATLAB’s stem to plot my channel impulse response. In order to get rid of all the zeros, I’m using a NaN array instead with only channel coefficients copied over. This produces a nice graph with no crammed points.

Point-to-Point ethernet cosim on ML506

Now works nicely. I need to redesign the buffering scheme so that I only send one packet but possibly receiever more than one depending on the size allowed. Can’t remember how many bytes per packet can be accommodated exactly, but certainly not much.

Note on heterodyning and decimation

A quick remark that having the heterodyning and decimation stages in succession means that we can absorb one LPF into the other nicely. I’ve retracted the stop band a bit to give some slack when halving the sampling rate which I hope is not affecting the direct sequence spread spectrum a lot.

On Matlab DSP real-time mockery – again

I think I know for sure now that built-in multi-rate functions are not terribly useful for mimicking real-time behavior since the delayline doesn’t save states from call to call. Therefore, I have to reproduce the same functionality from smaller building blocks to workaround this. For instance, to decimate I use the DSP toolbox function filter proceeded by downsample on an invocation-by-invocation basis. I only need to set the filter property ‘PersistentMemory’ to true.

Carrier distortion due to chip-rate synthesis turned out not to have big an impact on the overall operation. But it is still interesting from a numerical point of view that it is not perfect.

On MUTEX co-sim

I opted to implement mutual exclusion as a simple 4-phase asynchronous protocol in avoiding more elaborate OS-style techniques. Those often require a shared variable to resolve priority which should be writeable by the two contending processes.

Hardware co-sim workaround

OK, plan B. Here is the deal:

  • Not enough onboard memory (in FPGA) to buffer everything in one go
  • Some sort of mutex and synchronization is therefore needed
  • LockableSharedMemory class is not availabe from Matlab i.e. no mex interface
  • I can probably write a mex wrapper
  • This is error-prone because it’s uncharted territory for me
  • Zero help from Xilinx forums

Solution:

  • go back and implement your own mutex in hardware
  • use two registers owned by control statemachine and PC respectively
  • poll on these registers at both sides
  • then use Shmem as provided by Xilinx in Matlab

Co-simulation programming cable

In order to make the JTAG cable visible for co-simulation, I have found that IMPACT has to be invoked first and that the JTAG chain ought to be initialized.

Real-time spectral density measurement

I think I figured a nice way of measuring the spectral density online. Basically, I run FFT on my mixed-signal Agilent with the required spectral resolution and then switch on averaging, rather large averaging for a clear and loud online spectral analysis. Worked pretty fine for my DSSS signal.

Truncation threshold in adaptive acquisition

It is of utmost importance to setup the truncation threshold correctly in the adaptive algorithm. It should be done in accordance to the signal SNR. So far, I can think of two solutions:

  • Do it heuristically and implement an automatic gain control AGC loop
  • Figure a way to calculated SNR on the fly and work out the threshold based on this.

Few notes on Matlab DSP real-time mockery

After some trials to do with trying to put together a crude real-time-like DSP algorithm in Matlab, I have observed the following:

  • attempting to construct a carrier on the fly with oversampling rate results in imperfect frequency synthesis. 
tc_end = -ts;
.
.
.
for i = 1:N_mix
   tc(i) = tc_end + ts;
   tc_end = tc(i);
end
carrier_chip(:,1) = cos(2*pi*fc*tc) + j*sin(2*pi*fc*tc);
  • in order to force filter to remember states from call to call use (either syntax is fine):
set(Hd_lp, 'PersistentMemory', true);
Hd_lp.states = 0;
  • it is hard to use the supplied resample to do so to mock a real-time operation.  The function applies anti-aliasing filter which although I can force its order, I’m not sure how to make it save states. It’s always better to be on top of things and know exactly what’s happening instead of using “black boxes.”

Driving Dolphin Tx

When driving Dolphin calf v1.1, make sure both power and excitation signal have the same ground. If you don’t do so, the calf will produce a faint buzzing sound and no ultrasound. In other words, the power supply of my Tx daughterboard and the power supply driving the Dolphin calf should be jointly grounded.

Stateflow local variables

When defining an array or matrix as a local variable in Stateflow, bear in mind that memory addressing is C-like as opposed to Matlab-like i.e. indices always start at zero not one. You should be able to infer this from the syntax style as well.

Specifying data types in SysGen

Always make sure to specify data types for system generator blocks. Just spotted that the read port of BRAM was connected to a zero whose format hadn’t been properly done which was propagating through and ruining everything.

Bitwise logic also behaves like this. I had to use a Reinterpret block for the zero injector after manually bitbashing my signal in order to align the enable bus with the incoming data from the accumulation pipeline.

Feedback loop in Simulink/SysGen

Always remember to break the feedback path for designs in Simulink, SysGen, or both combined with a delay. Otherwise, it’s something to do with indeterministic initialization or something of this nature. Moreover, in combined designs, hardware clock always halves Simulink’s period i.e. z^-4 in Simulink gives effective z^-2 in SysGen.

Note on enables in SysGen

The only thing that takes precedence over enable is apparently reset. Everything else is wrapped with the enable if statement including load for instance. This actually caused me some minor headache before I managed eventually to spot it. Come to think of it, it has to be because these enables in the first place are meant to be activated by enable waveforms which is typical of FPGA designs.

Simulink can act weird

Being a graphical progamming environment, blocks copied from one model to another may not be initialized properly and cause a MATLAB crash. Always best to start with new models in general.

Wishbone IF statemachine in Simulink

I need to model the Wishbone bus transaction in order to proceed to simulate and verify my core. Using a Stateflow Chart, the bus statemachine is realized reading data from MATLAB’s workspace to feed the core.

Few notes on syntax and tool configuration:

  1. specify inputs and outputs
  2. design states and state transitions for the bus
  3. add local variable(s) e.g. memory pointer for the external buffer
  4. “Add Data” as parameters to be imported from workspace
  5. use prefix ml with the . dot operator to access native MATLAB variables or functions
  6. complex data is not supported so make sure you set up workspace accordingly.
  7. You can always edit things using right-click->explore

Note also that in the following line: dat_re_o = ml.outsig_resamp_re[mem_addr]; square brackets have to be used as opposed to crescent, Matlab-style ones.

Simulink addressable memory model

OK. Statemachine in Simulink works now. I need to find away to incorporate an addressable memory model with the IF Wishbone signals.

I posted asking about this in the Xilinx forums but I have few leads on how to do this. While scrolling through the Stateflow(R) PDF user guide, I saw that it’s possible to add some m-functions to the Stateflow object. Thus I can load the required variable programmatically to the workspace and easily implement a memory model addressed by a local pointer (variable) inside the statemachine and increment it upon the reception of a valid ack pulse from the core.

Generating handshake signals in Simulink to talk with SysGen

After a day of looking into this, I figured the best thing to do in order to simulate my WISHBONE-compliant core in Simulink is to generate timing-critical signals in a Stateflow Chart object from the Stateflow(R) library. This would leverage the feature-rich environment of Matlab and Simulink and as such is better than carrying the design to an HDL simulator.

I arrived at this after remembering that I’d come across something similar in the past. It was an early days SysGen design to interface with an OPB bus over CoreConnect(TM). A processor model was constructed by the Xilinx folks to feed the IP.

I will be testing the idea tomorrow, until then: Astral Doors RULES!

Gold code generation unit revised

I’ve fixed the gold code generation unit to have all codes in sync when propagating through the delayline. This is done by adding a small statemachine to the unit.

Always try to decentralize control since it becomes more manageable.

Note on selecting Chipscope Pro signals in probing banks

When selecting signals to be probed by Chipscope Pro in a given bank, inputs should be read after their respective buffers not before them. This generates an error in the PAR or implementation phases.

Chipscope Pro 10.1.03 bug

After the usual wrestling with the tools, it turned out that using Chipscope 10.1.03 with the ATC2 core results in pins that are locked to incorrect locations. A patch is provided by Xilinx to address this as detailed in answer record:

AR #31746 – 10.1 ChipScope Pro – When I generate my ATC2 Agilent using the Inserter flow, the pins are locked to incorrect locations

Issue is now resolved and I managed to debug my VHDL code now.

Local definitions in TikZ

Apparently, when attempting to define something local in TikZ, we can’t use numbers at all. I instead use Latin numbers if I’m keen on indexing letters or words. Example

\def\Ti{\textcolor{coors}{$T_1$}}
\def\Tii{\textcolor{coors}{$T_2$}}
\def\Tiii{\textcolor{coors}{$T_3$}}

Anthorn official logo

I’m pleased to announce that my system has now a brand new logo celebrating the official departure from mainstream Relate!

Ultracontroller on ML403 resolved

After a rather agonizing week of blind speculative troubleshooting, I managed to understand exactly what took place during tests. Information presented here is a compilation of bits and pieces gathered from various usenet posts – primarily comp.arch.fpga and Xilinx forums – with emphasis on reliable people.

  • bottom line: Xilinx 9.2i tools are ‘buggy’, avoid using at all cost! 9.1 is vouched for by experts to be much more stable. My dillemma, however, is to be able to use ISE in conjunction with BOTH EDK and SysGen. 9.1 doesn’t support SysGen. For the time being, I have both installed and I’ll be alternating back-and-forth between them with the ever daunting task of changing Windows Xilinx’ Environment Variables each hop. Look into scripting this presuming it can be done on Windows. Can this be done in cygwin?
  • now, on the ‘buggy’ 9.2 tools issue, I observed the following:
    • ISE9.2 doesn’t spot certain errors when compared to 9.1 e.g. in ucf the original source code had the async interrupt constraint applied on the VHDL signal instead of the corresponding I/O pin. Not sure if this didn’t use to be allowed and now is.
    • # Ignore this gpio bit as it is an async interrupt in this example
      NET "gpio_out_s&lt;4&gt;" TIG;
    • there is definitely something wrong with the IMPACT executable in the 9.2 version. I tried programming the PROM with a pregenerated .mcs file in the reference design but it didn’t work. This was observed under ISE9.2.04 & EDK9.2.02 (both latest, 0x denoting sp x); however, it did work with the ISE9.2.03 & EDK9.2 combination, I mean just for the precompiled .mcs. I may later submit a webcase to Xilinx about this, if I have the patience and probably with Mike’s aid on this.

That’s it on the UC2 issue. Will be posting soon about testing Anthorn.

A method for debugging FPGA designs

After reading up a bit on the subject, I think I have now figured quite a neat way to accomplishing this.

Problem arised when wanting to debug the Anthorn analog daughterboard without having to fully define my system architecture at this point. Antti (comp.arch.fpga) pointed out that he usually verify that an SPI chip is responding by first testing it in software (SPI software). Now given the horrendous delay of accesses over PLB (for instance with all the arbitration overhead), my software SPI would be too slow. TI’s SAR ADC can’t handle slow SPI because the successive approximation architecture is based around a capacitive network (silicon efficiency) with inherent S/H.

What I think now is a viable solution to even future debugging is an UltracontrollerII design with PPC running fast enough to emulate the functionality under test. Be sure to configure .ucf accordingly for the ml403 board with all the gpio’s you need defined in there.

Anthorn Analog Front-end Interface

Wrap a generic SPI core inside a customized ADC protocol to facilitate modularity.

Open issues I’m still considering are:

  • how much memory I should assign to the front-end buffer
  • do I want to burden the software with the task of moving these chunks of data (DMA?)
  • should I adopt a datapath approach

Bottom line now is that after I’m done with these rudimentary tasks, I should look into concretely and systematically designing the foundation of my architecture. Still I would expect major revisions every now and again as the design matures.

Matt of PCB

This is something I once wrote in praise of EIS’s electronics RA Matt Oppenheim. Thought it’ll spice up things around here a bit.

Onward into the heart of doom marched Matt
Overwhelmed many times, still he carried on
Sweat poured forth from his forehead spreading onto his tools
Jeerers waited for the trembling fingers that once made magic
But Zargos alone would choose the day he would be condemned to oblivion
And in his hour of need
He sent forth unto him the PCB foresight!
Now god and man he rose up from his desk
Screaming like a wild animal
Such is the gift of absolute electronic wisdom!
No EM noise could harm him; he laid out R’s and C’s alike
And every IC trod on his board
Were soldered that day!
Hail Matt of PCB!

Intro

Here you’ll see the nitty-gritty of the daily research I conduct.

Accessing ACC from C30

For the dsPIC DSC, in order to access certain efficient assembly instructions, some special built-in C30 functions and syntax have to be used. For detailed explanation see the C30 manual. Here I list a code snippet to illustrate the concept:

float matchedEnv;
register int acca asm("A");
fractional fractBuff;
.
.
.
acca = __builtin_clr();
acca = __builtin_mac(matchedOut[TAP_OFF_INDEX+k], 
             matchedOut[TAP_OFF_INDEX+k],
             NULL, NULL, NULL, NULL, NULL, NULL, NULL);
.
.
fractBuff = __builtin_sac(acca, 0);
matchedEnv = Fract2Float(fractBuff);

Will be posting more on this later.

Up’n'running

In an attempt to become a tech trendy, I’m giving this a go. Usually, I don’t feel comfortable expressing myself in public. So I’ll try to keep this nerdy, research-y, and dry. Be warned! Still there is a glimmer of hope here that you might actually stumble on something cool every now and again.



Copyright © 2008–2010. All rights reserved.

Powered by Wordpress using Modern Clix.