Skip to main content

Alibaba cloud FPGA: the 200$ Kintex UltraScale+

·8047 words

Introduction
#

I was recently in the market for a new FPGA to start building my upcoming projects on.

Due to the scale of my upcoming projects a Xilinx series 7 UltraScale+ FPGA of the Virtex family would be perfect, but a Kintex series FPGA will be sufficient for early prototyping. Due to not wanting to part ways with the eye watering amounts of money that is required for an Vivado enterprise edition license my choice was effectively narrowed to the FPGA chips available under the WebPack version of Vivado.

Xilinx supported boards per vivado edition
Xilinx supported boards per Vivado edition

Unsurprisingly Xilinx are well aware of how top of the range the Virtex series are, and doesn’t offer any Virtex UltraScale+ chips with the webpack license. That said, they do offer support for two very respectable Kintex UltraScale+ FPGA models, the XCKU3P and the XCKU5P.

Xiling product guide, overview for the Kintex UltraScale+ series
Xiling product guide, overview for the Kintex UltraScale+ series

These two chips are far from being small hobbyist toys, with the smaller XCUK3P already boasting +162K LUTs and 16 GTY transceivers, capable, depending on the physical constraints imposed by the chip packaging of operating at up to 32.75Gb/s.

Now that the chip selection has been narrowed down I set out to look for a dev board.

My requirements for the board where that it featured :

  • at least 2 SFP+ or 1 QSFP connector
  • a JTAG interface
  • a PCIe interface at least x8 wide

As to where to get the board from, my options where :

  1. Design the board myself
  2. Get the AXKU5 or AXKU3 from Alinx
  3. See what I could unearth on the second hand market

Although option 1 could have been very interesting, designing a dev board with both a high speed PCIe and ethernet interface was not the goal of today’s project.

As for option 2, Alinx is newer vendor that is still building up its credibility in the west, their technical documentation is a bit sparse, but the feedback seems to be positive with no major issues being reported. Most importantly, Alinx provided very fairly priced development boards in the 900 to 1050 dollar range ( +150$ for the HPC FMC SFP+ extension board ). Although these are not cheap by any metric, compared to the competitions price point, they are the best value.

Option 2 was coming up ahead until I stumbled upon this ebay listing :

Ebay listing for a decommissioned Alibaba Cloud accelerator FPGA
Ebay listing for a decommissioned Alibaba Cloud accelerator FPGA
For 200$ this board featured a XCKU3P-FFVB676, 2 SPF+ connector and a x8 PCIe interface. On the flip side it came with no documentation whatsoever, no guaranty it worked, and the faint promise in the listing that there was a JTAG interface. A sane person would likely have dismissed this as an interesting internet oddity, a remanence of what happens when a generation of accelerator cards gets phased out in favor of the next, or maybe just an expensive paperweight.

But I like a challenge, and the appeal of unlocking the 200$ Kintex UltraScale+ development board was too great to ignore.

As such, I aim for this article to become the documentation paving the way to though this mirage.

The debugger challenge
#

Xilinx’s UG908 Programming and Debugging User Guide (Appendix D) specifies their blessed JTAG probe ecosystem for FPGA configuration and debug. Rather than dropping $100+ on yet another proprietary dongle that’ll collect dust after the project ends, I’m exploring alternatives. The obvious tradeoff: abandoning Xilinx’s toolchain means losing ILA integration. However, the ILA fundamentally just captures samples and streams them via JTAG USER registers, there’s nothing preventing us from building our own logic analyzer with equivalent functionality and a custom host interface.

Enter OpenOCD. While primarily targeting ARM/RISC-V SoCs, it maintains an impressive database of supported probe hardware and provides granular control over JTAG operations. More importantly, it natively supports SVF (Serial Vector Format), a vendor-neutral bitstream format that Vivado can export.

The documentation landscape is admittedly sparse for anything beyond 7-series FPGAs, and the most recent OpenOCD documentation I could unearth was focused on Zynq ARM core debugging rather than fabric configuration. But the fundamentals remain sound: JTAG is JTAG, SVF is standardized, and the boundary scan architecture hasn’t fundamentally changed.

The approach should be straightforward: generate SVF from Vivado, feed it through OpenOCD with a commodity JTAG adapter, and validate the configuration. Worst case, we’ll need to patch some adapter-specific quirks or boundary scan chain register addresses. Time to find out if this theory holds up in practice.

The plan
#

So, to resume, the current plan is to buy a second hand hardware accelerator of eBay at a too good to be true price, and try to configure it with an unofficial probe using open source software without any clear official support.
The answer to the obvious question you are thinking if you, like me, have been around the block a few times is: many things.

As such, we need a plan for approaching this. The goal of this plan is to outline incremental steps that will build upon themselves with the end goal of being able to use this as a dev board.

1 - Confirming the board works
#

First order of business will be to confirm the board is showing signs of working as intended.

There is a high probability that the flash wasn’t wiped before this board was sold off, as such the previous bitstream should still be in the flash. Given this board was used as an accelerator, we should be able to use that to confirm the board is working by either checking if the board is presenting itself as a PCIe endpoint or if the SFP’s are sending the ethernet PHY idle sequence.

2 - Connecting a debugger to it
#

The next step is going to be to try and connect the debugger. The eBay listing advertised there is a JTAG interface, but the picture is grainy enough that where that JTAG is and what pins are available is unclear.

Additionally, we have no indication of what devices are daisy chained together onto the JTAG scan chain. This is an essential question for flashing over JTAG, so it will need to be figured out.

At this point, it would also be strategic to try and do some more probing into the FPGA via JTAG. Xilinx FPGAs exposes a handful of useful system registers accessible over JTAG. The most well known of these interfaces is the SYSMON, which allows us, among other things, to get real time temperature and voltage reading from inside the chip. Although openOCD doesn’t have SYSMON support out of the box it would be worth while to build it, to :

  1. Familiarise myself with openOCD scripting, this might come in handy when building my ILA replacement down the line
  2. Having an easy side channel to monitor FPGA operating parameters
  3. Make a contribution to openOCD as it have support for the interfacing with XADC but not SYSMON

3 - Figuring out the Pinout
#

The hardest part will be figuring out the FPGA’s pinout and my clock sources. The questions that need answering are :

  • what external clocks sources do I have, what are there frequencies and which pins are they connected to
  • which transceivers are the SFPs connected to
  • which transceivers is the PCIe connected to

4 - Writing a bitstream
#

For now I will be focusing on writing a temporary configurations over JTAG to the CCLs and not re-writing the flash.

That plan is to trying writing either the bitstream directly though openOCD’s virtex2 + pld drivers, or by replaying the SVF generated by Vivado.

Since I believe a low iteration time is paramount to project velocity and getting big things done, I also want automatize all of the Vivado flow from taking the rtl to the SVF generation.

Simple enough ?

Liveness test
#

A few days later my prize arrived via express mail.

fpga
My prized Kintex UltraScale+ FPGA board also known as the decommissioned Alibaba cloud accelerator. Jammed transceiver now safely removed.

Unexpectedly it even came with a free 25G SFP28 Huawei transceiver rated for a 300m distance and a single 1m long OS2 fiber patch cable. This was likely not intentional as the transceiver was jammed in the SFP cage, but it was still very generous of them to include the fiber patch cable.

Additional SFP28-25G-1310nm-300m-SM Huawei transceiver, and 1m long OS2 patch cable
Free additional SFP28-25G-1310nm-300m-SM Huawei transceiver, and 1m long OS2 patch cable

The board also came with a travel case and half of a PCIe to USB adapter and a 12V power supply that one could use to power the board as a standalone device. Although this standalone configuration will not be of any use to me, for those looking to develop just networking interfaces without any PCIe interface, this could come in handy.

Overall the board looked a little worn, but both the transceiver cages and PCIe connectors didn’t look to be damaged.

Standalone configuration
#

Before real testing could start I first did a small power-up test using the PCIe to USB adapter that the seller provided. I was able to do a quick check using the LEDs and the FPGAs dissipated heat that the board seemed to be powering up at a surface level (pun intended).

PCIe interface
#

As a reminder, this next section relies on the flash not having been wiped and still containing the previous user’s design.

Since I didn’t want to directly plug mystery hardware into my prized build server, I decided to use a Raspberry Pi 5 as my sacrificial test device and got myself an external PCIe adapter.

It just so happened that the latest Raspberry Pi version, the Pi 5, now features an external PCIe Gen 2.0 x1 interface. Though our FPGA can handle up to a PCIe Gen 3.0 and the board had a x8 wide interface, since PCIe standard is backwards compatible and the number of lanes on the interface can be downgraded, plugging our FPGA with this Raspberry Pi will work.

FPGA board connected to the Raspberry Pi 5 via the PCIe to PCIe x1 adapter
FPGA board connected to the Raspberry Pi 5 via the PCIe to PCIe x1 adapter

After both the Raspberry and the FPGA were booted, I SSHed into my rpi and started looking for the PCIe enumeration sequence logged from the Linux PCIe core subsystem.

dmesg log :

[    0.388790] pci 0000:00:00.0: [14e4:2712] type 01 class 0x060400
[    0.388817] pci 0000:00:00.0: PME# supported from D0 D3hot
[    0.389752] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    0.495733] brcm-pcie 1000110000.pcie: link up, 5.0 GT/s PCIe x1 (!SSC)
[    0.495759] pci 0000:01:00.0: [dabc:1017] type 00 class 0x020000

Background information
#

Since most people might not be intimately as familiar with PCIe terminology, allow me to quickly document what is going on here.

0000:00:00.0: is the identifier of a specific PCIe device connected through the PCIe network to the kernel, it read as domain:bus:device.function.

[14e4:2712]: is the device’s [vendor id:device id], these vendor id identifiers are assigned by the PCI standard body to hardware vendors. Vendors are then free to define there own vendor id’s.

The full list of official vendor id’s and released device id can be found : https://admin.pci-ids.ucw.cz/read/PC/14e4 or in the linux kernel code : https://github.com/torvalds/linux/blob/7aac71907bdea16e2754a782b9d9155449a9d49d/include/linux/pci_ids.h#L160-L3256

type 01: PCIe has two types of devices, bridges allowing the connection of multiple downstream devices to an upstream device, and endpoints are the leafs. Bridges are of type 01 and endpoints of type 00.

class 0x60400: is the PCIe device class, it categorizes the kind of function the device performs. It uses the following format 0x[Base Class (8 bits)][Sub Class (8 bits)][Programming Interface (8 bits)], ( note : the sub class field might be unused ).

A list of class and sub class identifiers can be found: https://admin.pci-ids.ucw.cz/read/PD or again in the linux codebase : https://github.com/torvalds/linux/blob/7aac71907bdea16e2754a782b9d9155449a9d49d/include/linux/pci_ids.h#L15-L158

Dmesg log
#

The two most interesting lines of the dmesg log are :

[    0.388790] pci 0000:00:00.0: [14e4:2712] type 01 class 0x060400
[    0.495759] pci 0000:01:00.0: [dabc:1017] type 00 class 0x020000

Firstly the PCIe subsystem logs that at 0000:00:00.0 it has discovered a Broadcom BCM2712 PCIe Bridge ( vendor id 14e4, device id 0x2712 ).This bridge (type 01) class 0x0604xx tells us it is a PCI-to-PCI bridge, meaning it is essentially creating additional PCIe lanes downstream for endpoint devices or additional bridges.

The subsystem then discovers a second device at 0000:01:00.0, this is an endpoint (type 00), and class 0x02000 tells us it is an ethernet networking equipment.
Of note dabc doesn’t correspond to a known vendor id. When designing a PCIe interface in hardware these are parameters we can configured. Additionally, among the different ways Linux uses to identify which driver to load for a PCIe device the vendor id and device id can be used for matching. Supposing we are implementing custom logic, in order to prevent any bug where the wrong driver might be loaded, it is best to use a separate vendor id. This also helps identify your custom accelerator at a glance and use it to load your custom driver.

As such, it is not surprising to see an unknown vendor id appear for an FPGA, this with the class as an ethernet networking device is a strong hint this is our board.

Full PCIe device status
#

Dmesg logs have already given us a good indication that our FPGA board and its PCIe interface was working but to confirm with certainty that the device with vendor id dabc is our FPGA we now turn to lspci. lspci -vvv is the most verbose output and gives us a full overview of the detected PCIe devices capabilities and current configurations.

Broadcom bridge:

0000:00:00.0 PCI bridge: Broadcom Inc. and subsidiaries BCM2712 PCIe Bridge (rev 21) (prog-if 00 [Normal decode])
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 38
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        Memory behind bridge: [disabled] [32-bit]
        Prefetchable memory behind bridge: 1800000000-182fffffff [size=768M] [32-bit]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR- NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [48] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [ac] Express (v2) Root Port (Slot-), MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr+ NoSnoop+
                        MaxPayload 512 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <2us, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot+ ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x1
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt+
                RootCap: CRSVisible+
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd+
                         AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, ARIFwd-
                         AtomicOpsCtl: ReqEn- EgressBlck-
                LnkCap2: Supported Link Speeds: 2.5-5GT/s, Crosslink- Retimer- 2Retimers- DRS+
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported, DRS-
                         DownstreamComp: Link Up - Present
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
                RootCmd: CERptEn+ NFERptEn+ FERptEn+
                RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
                         FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
                ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
        Capabilities: [160 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [180 v1] Vendor Specific Information: ID=0000 Rev=0 Len=028 <?>
        Capabilities: [240 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=8us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=1us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [300 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Kernel driver in use: pcieport

FPGA board:

0000:01:00.0 Ethernet controller: Device dabc:1017
        Subsystem: Red Hat, Inc. Device a001
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Region 0: Memory at 1820000000 (64-bit, prefetchable) [disabled] [size=2K]
        Region 2: Memory at 1800000000 (64-bit, prefetchable) [disabled] [size=512M]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 512 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s (downgraded), Width x1 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range BC, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [1c0 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0

For our board, the following lines are particularly interesting:

                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s (downgraded), Width x1 (downgraded)0x060400

The LnkCap tells us about the full capabilities of this PCIe device, here we can see that the current design supports PCIe Gen 3.0 x8. The LnkSta tells us the current configuration, here we have been downgraded to PCIe Gen 2.0 at 5GT/s with a width of only x1.

During startup of when a new PCIe device is plugged, PCIe performs a link speed and width negotiation where it tries to reach the highest supported stable configuration for the current system. In our current system, though our FPGA is capable of 8GT/s, as it is located downstream of the Broadcom bridge with a maximum link capacity of Gen 2.0 ( 5GT/s ), the FPGA has been downgraded to 5GT/s.

As for the width of x1, that is expected since the Broadcom bridge is also only x1 wide, and our board’s other 7 PCIe lanes are literally hanging over the side.

7 PCIe lanes left unconnected and hanging over the air
7 PCIe lanes left unconnected and hanging over the air

Thus, we can finally confirm that this is our board and that the PCIe interface is working. We can now proceed to establishing the JTAG connection.

JTAG interface
#

Xilinx FPGAs can be configured by writing a bitstream to their internal CMOS Configuration Latches (CCL). CCL is SRAM memory and volatile, thus the configuration is re-done on every power cycle. For devices in the field this bitstream would be read from an external SPI memory during initialization, or written from an external device, such as an embedded controller. But for development purposes overwriting the contents of the CCLs over JTAG is acceptable.

This configuration is done by shifting in the entire FPGA bitstream into the device’s configuration logic over the JTAG bus.

FPGA board JTAG interface
#

As promised by the original eBay listing the board did come with an accessible JTAG interface, and gloriously enough, this time there wasn’t even the need for any additional soldering.

View of the JTAG interface on the PCB
View of the JTAG interface on the PCB

In addition to a power reference, and ground, conformely to the Xilinx JTAG interface it featured the four mandatory signals comprising the JTAG TAP :

  • TCK Test Clock
  • TMS Test Mode Select
  • TDI Test Data Input
  • TDO Test Data Output

Of note, the JTAG interface can also come with an independent reset signal. But since Xilinx JTAG interfaces do not have this independent reset signal, we be using the JTAG FSM reset state for our reset signal.

very nice documentation of the board JTAG pinout
6 pin board JTAG interface

This interface layout doesn’t follow a standard layout so I cannot just plug in one of my debug probes, it requires some re-wiring.

Segger JLINK :heart:#

I do not own an AMD approved JTAG programmer.

Traditionally speaking, the Segger JLink is used for debugging embedded CPUs let them be standalone or in a Zynq, and not for configuring FPGAs.

That said, all we need to do is use JTAG to shift in a bitstream to the CCLs, so technically speaking any programmable device with 4 sufficiently fast GPIOs can be used as a JTAG programmer. Additionally, the JLink is well supported by OpenOCD, the JLink’s libraries are open source, and I happened to own one.

Note : I could also have used a USB Blaster, which considering it is literally an Altera tool would have made it hilarious.
very nice 20 pin segger JLink pinout interface documentation
20 pin segger JLink pinout

Wiring
#

Rewiring :

very nice JTAG wiring diagram to connect JLink jtag probe to fpga board
Wiring diagram to connect JLink JTAG probe to the board.

JTAG is a parallel protocol where TDI and TMS will be captured according to TCK. Because of this, good JTAG PCB trace length matching is advised in order to minimize skew.

Timing Waveform for JTAG Signals (From Target Device Perspective)
Timing Waveform for JTAG Signals (From Target Device Perspective); source : https://www.intel.com/content/www/us/en/docs/programmable/683719/current/jtag-timing-constraints-and-waveforms.html

Ideally a custom connector with length matched traces to work as an interface between the JLink’s probe and a board specific connector would be used.

Far from length matched JTAG connections
Far from length matched JTAG connections

Yet, here we are shoving breadboard wires between our debugger and the board. Since OpenOCD allows us to easily control the debugger clock speed, we can increase the skew tolerance by slowing down the TCK clock signal. As such there is no immediate need for a custom connector but we will not be able to reach the maximum JTAG speeds.

If no clock speed is specified OpenOCD sets the clock speed at 100MHz. This is too high in our case. As such, latter in the article, I will be setting the JTAG clock down to 1MHz for probing and reset, programming will be done at 10MHz.
No issues were encountered at these speeds.

OpenOCD
#

OpenOCD is a free and open source on-chip debugger software that aims to be compatible with as many probes, boards and chips as possible.

Since OpenOCD has support for the standard SVF file format, my plan for my flashing flow will be to use Vivado to generate the SVF and have OpenOCD flash it. Now, some of you might be starting to notice that I am diverging quite far from the well lit path of officially supported tools. Not only am I using a not officially supported debug probe, but I am also using some obscure open source software with questionable support for interfacing with Xilinx UltraScale+ FPGAs. You might be wondering, given that the officially supported tools can already prove themselves to be a headache to get working properly, why am I seemingly making my life even harder?

The reason is quite simple: when things inevitably start going wrong, as they will, having an entirely open toolchain, allows me to have more visibility as to what is going on and the ability to fix it. I cannot delve into a black box.

Building OpenOCD
#

By default the version of OpenOCD that I got on my server via the official packet manager was outdated and missing features I will need.

Also, since saving the ability to modify OpenOCD’s source code could come in handy, I decided to re-build it from source.

Thus, in the following logs, I will be running OpenOCD version 0.12.0+dev-02170-gfcff4b712.

Note : I have also re-build the JLink libs from source.

Determining the scan chain
#

Since I do not have the schematics for the board I do not know how many devices are daisy-chainned on the board JTAG bus. Also, I want to confirm if the FPGA on the ebay listing is actually the one on the board. In JTAG, each chained device exposes an accessible IDCODE register used to identify the manufacturer, device type, and revision number.

When setting up the JTAG server, we typically define the scan chain by specifying the expected IDCODE for each TAP and the corresponding instruction register length, so that instructions can be correctly aligned and routed to the intended device. Given this is an undocumented board off Ebay, I do not know what the chain looks like. Fortunately, OpenOCD has an autoprobing functionality, to do a blind interrogation in an attempt to discover the available devices.

Thus, my first order of business was doing this autoprobing.

In OpenOCD the autoprobing is done when the configuration does not specify any taps.

source [find interface/jlink.cfg]
transport select jtag

set SPEED 1
jtag_rclk $SPEED
adapter speed $SPEED

reset_config none

The blind interrogation successfully discovered a single device on the chain with an IDCODE of 0x04a63093.

gp@workhorse:~/tools/openocd_jlink_test/autoprob$ openocd
Open On-Chip Debugger 0.12.0+dev-02170-gfcff4b712 (2025-09-04-21:02)
Licensed under GNU GPL v2
For bug reports, read
	http://openocd.org/doc/doxygen/bugs.html
none separate
Info : Listening on port 6666 for tcl connections
Info : Listening on port 4444 for telnet connections
Info : J-Link V10 compiled Jan 30 2023 11:28:07
Info : Hardware version: 10.10
Info : VTarget = 1.812 V
Info : clock speed 1 kHz
Warn : There are no enabled taps.  AUTO PROBING MIGHT NOT WORK!!
Info : JTAG tap: auto0.tap tap/device found: 0x04a63093 (mfg: 0x049 (Xilinx), part: 0x4a63, ver: 0x0)
Warn : AUTO auto0.tap - use "jtag newtap auto0 tap -irlen 2 -expected-id 0x04a63093"
Error: IR capture error at bit 2, saw 0x3ffffffffffffff5 not 0x...3
Warn : Bypassing JTAG setup events due to errors
Warn : gdb services need one or more targets defined

Comparing against the UltraScale Architecture Configuration User Guide (UG570) we see that this IDCODE matches up precisely with the expected value for the KU3P.

JTAG and IDCODE for UltraScale Architecture-based FPGAs
JTAG and IDCODE for UltraScale Architecture-based FPGAs

By default OpenOCD assumes a JTAG IR length of 2 bits, while our FPGA has an IR length of 6 bits. This is the cause behind the IR capture error encountered during autoprobing. By updating the script with an IR length of 6 bits we can re-detect the FPGA with no errors.

source [find interface/jlink.cfg]
transport select jtag

set SPEED 1
jtag_rclk $SPEED
adapter speed $SPEED

reset_config none

jtag newtap auto_detect tap -irlen 6

Output :

gp@workhorse:~/tools/openocd_jlink_test/autoprob$ openocd
Open On-Chip Debugger 0.12.0+dev-02170-gfcff4b712 (2025-09-04-21:02)
Licensed under GNU GPL v2
For bug reports, read
	http://openocd.org/doc/doxygen/bugs.html
Info : Listening on port 6666 for tcl connections
Info : Listening on port 4444 for telnet connections
Info : J-Link V10 compiled Jan 30 2023 11:28:07
Info : Hardware version: 10.10
Info : VTarget = 1.812 V
Info : clock speed 1 kHz
Info : JTAG tap: auto_detect.tap tap/device found: 0x04a63093 (mfg: 0x049 (Xilinx), part: 0x4a63, ver: 0x0)
Warn : gdb services need one or more targets defined

Based on the probing, this is the JTAG scan chain for our board :

JTAG scan chain for the alibaba cloud FPGA
JTAG scan chain for the alibaba cloud FPGA

System Monitor Registers
#

Previous generations of Xilinx FPGA had a system called the XADC that, among other features, allowed you to acquire chip temperature and voltage readings. The newer UltraScale and UltraScale+ family have deprecated this XADC module in favor of the SYSMON (and SYSMON4) which allows you to also get these temperature readings, just better.

Unfortunately, openOCD didn’t have support for reading the SYSMON over JTAG out of the box, so I will be adding it.

To be more precise, the Kintex UltraScale+ has a SYSMON4 and not a SYSMON. For full context, there are 3 flavors of SYSMON:

  • SYSMON1 used in the Kintex and Virtex UltraScale series
  • SYSMON4 used in the Kintex, Virtex and in the Zynq programmable logic for the UltraScale+ series
  • SYSMON used in the Zynq in the processing system of the UltraScale+ series.
    Yes, you read that correctly the Zynq of the UltraScale+ series features not one, but at least two unique SYSMON instances.

For the purpose of this article, all these instances are similar enough that I will be using the terms SYSMON4 and SYSMON interchangeably.

In order for the JTAG to interact with the SYSMON, we first need to write the SYSMON_DRP command to the JTAG Instruction Register (IR). Based on the documentation, we see that this command has a value of 0x37, which funnily enough, is the same command code as the XADC, solidifying the SYSMON as the XADC’s descendant.

The SYSMON offers a lot more additional functionalities than just being used to read voltage and temperature, but for today’s use case we will not be using any of that. Rather, we will focus only on reading a subset of the SYSMON status registers.

These status registers are located at addresses (00h-3Fh, 80h-BFh), and contain the measurement results of the analog-to-digital conversions, the flag registers, and the calibration coefficients. We can select which address we wish to read by writing the address to the Data Register (DR) over JTAG and the data will be read out of TDO.

# SPDX-License-Identifier: GPL-2.0-or-later

# Xilinx SYSMON4 support
#
# Based on UG580, used for UltraScale+ Xilinx FPGA
# This code implements access through the JTAG TAP.
#
# build a 32 bit DRP command for the SYSMON DRP
proc sysmon_cmd {cmd addr data} {
	array set cmds {
		NOP 0x00
		READ 0x01
		WRITE 0x02
	}
	return [expr {($cmds($cmd) << 26) | ($addr << 16) | ($data << 0)}]
}

# Status register addresses
# Some addresses (status registers 0-3) have special function when written to.
proc SYSMON {key} {
	array set addrs {
		TEMP 0x00
		VCCINT 0x01
		VCCAUX 0x02
		VPVN 0x03
		VREFP 0x04
		VREFN 0x05
		VCCBRAM 0x06
		SUPAOFFS 0x08
		ADCAOFFS 0x09
		ADCAGAIN 0x0a
		VCCPINTLP 0x0d
		VCCPINTFP 0x0e
		VCCPAUX 0x0f
		VAUX0 0x10
		VAUX1 0x11
		VAUX2 0x12
		VAUX3 0x13
		VAUX4 0x14
		VAUX5 0x15
		VAUX6 0x16
		VAUX7 0x17
		VAUX8 0x18
		VAUX9 0x19
		VAUX10 0x1a
		VAUX11 0x1b
		VAUX12 0x1c
		VAUX13 0x1d
		VAUX14 0x1e
		VAUX15 0x1f
		MAXTEMP 0x20
		MAXVCC 0x21
		MAXVCCAUX 0x22
	}
	return $addrs($key)
}

# transfer
proc sysmon_xfer {tap cmd addr data} {
	set ret [drscan $tap 32 [sysmon_cmd $cmd $addr $data]]
	runtest 10
	return [expr "0x$ret"]
}

# sysmon register write
proc sysmon_write {tap addr data} {
	sysmon_xfer $tap WRITE $addr $data
}

# sysmon register read, non-pipelined
proc sysmon_read {tap addr} {
	sysmon_xfer $tap READ $addr 0
	return [sysmon_xfer $tap NOP 0 0]
}


# Select the sysmon DR, SYSMON_DRP has the same binary code value as the XADC
proc sysmon_select {tap} {
	set SYSMON_IR 0x37
	irscan $tap $SYSMON_IR
	runtest 10
}

# convert 16 bit temperature measurement to Celsius
proc sysmon_temp_internal {code} {
	return [expr {$code * 509.314/(1 << 16) - 280.23}]
}

# convert 16 bit supply voltage measurments to Volt
proc sysmon_sup {code} {
	return [expr {$code * 3./(1 << 16)}]
}

# measure all internal voltages
proc sysmon_report {tap} {
	puts "Sysmon status report :"
	sysmon_select $tap
	foreach ch [list TEMP MAXTEMP] {
		echo "$ch [format %.2f [sysmon_temp_internal [sysmon_read $tap [SYSMON $ch]]]] C"
	}
	foreach ch [list VCCINT MAXVCC VCCAUX MAXVCCAUX] {
		echo "$ch [format %.3f [sysmon_sup [sysmon_read $tap [SYSMON $ch]]]] V"	
	}
}

I added a report that reads the current chip temperature, internal and external voltages as well as the maximum values for these recorded since FPGA power cycle, to my flashing script output:

gp@workhorse:~/tools/openocd_jlink_test$ openocd
Open On-Chip Debugger 0.12.0+dev-02170-gfcff4b712 (2025-09-04-20:02)
Licensed under GNU GPL v2
For bug reports, read
	http://openocd.org/doc/doxygen/bugs.html
set chipname XCKU3P
Read temperature sysmon 4
Info : J-Link V10 compiled Jan 30 2023 11:28:07
Info : Hardware version: 10.10
Info : VTarget = 1.819 V
Info : clock speed 1 kHz
Info : JTAG tap: XCKU3P.tap tap/device found: 0x04a63093 (mfg: 0x049 (Xilinx), part: 0x4a63, ver: 0x0)
Warn : gdb services need one or more targets defined
--------------------
Sysmon status report :
TEMP 31.12 C
MAXTEMP 34.62 C
VCCINT 0.852 V
MAXVCC 0.855 V
VCCAUX 1.805 V
MAXVCCAUX 1.807 V

Pinout
#

To my indescribable joy I happened to stumble onto this gold mine, in which we get the board pinout. This most likely fell off a truck: https://blog.csdn.net/qq_37650251/article/details/145716953

So far this pinout looks correct.

Pin IndexNameIO StandardLocationBank
0diff_100mhz_clk_pLVDSE18BANK67
1diff_100mhz_clk_nLVDSD18BANK67
2sfp_mgt_clk_pLVDSK7BANK227
3sfp_mgt_clk_nLVDSK6BANK227
4sfp_1_txn-B6BANK227
5sfp_1_txp-B7BANK227
6sfp_1_rxn-A3BANK227
7sfp_1_rxp-A4BANK227
8sfp_2_txn-D6BANK227
9sfp_2_txp-D7BANK227
10sfp_2_rxn-B1BANK227
11sfp_2_rxp-B2BANK227
12SFP_1_MOD_DEF_0LVCMOS18D14BANK87
13SFP_1_TX_FAULTLVCMOS18B14BANK87
14SFP_1_LOSLVCMOS18D13BANK87
15SFP_1_LEDLVCMOS18B12BANK87
16SFP_2_MOD_DEF_0LVCMOS18E11BANK86
17SFP_2_TX_FAULTLVCMOS18F9BANK86
18SFP_2_LOSLVCMOS18E10BANK86
19SFP_2_LEDLVCMOS18C12BANK87
20IIC_SDA_SFP_1LVCMOS18C14BANK87
21IIC_SCL_SFP_1LVCMOS18C13BANK87
22IIC_SDA_SFP_2LVCMOS18D11BANK86
23IIC_SCL_SFP_2LVCMOS18D10BANK86
24IIC_SDA_EEPROM_0LVCMOS18G10BANK86
25IIC_SCL_EEPROM_0LVCMOS18G9BANK86
26IIC_SDA_EEPROM_1LVCMOS18J15BANK87
27IIC_SCL_EEPROM_1LVCMOS18J14BANK87
28GPIO_LED_RLVCMOS18A13BANK87
29GPIO_LED_GLVCMOS18A12BANK87
30GPIO_LED_HLVCMOS18B9BANK86
31GPIO_LED_1LVCMOS18B11BANK86
32GPIO_LED_2LVCMOS18C11BANK86
33GPIO_LED_3LVCMOS18A10BANK86
34GPIO_LED_4LVCMOS18B10BANK86
35pcie_mgt_clkn-T6BANK225
36pcie_mgt_clkp-T7BANK225
37pcie_tx0_n-R4BANK225
38pcie_tx1_n-U4BANK225
39pcie_tx2_n-W4BANK225
40pcie_tx3_n-AA4BANK225
41pcie_tx4_n-AC4BANK224
42pcie_tx5_n-AD6BANK224
43pcie_tx6_n-AE8BANK224
44pcie_tx7_n-AF6BANK224
45pcie_rx0_n-P1BANK225
46pcie_rx1_n-T1BANK225
47pcie_rx2_n-V1BANK225
48pcie_rx3_n-Y1BANK225
49pcie_rx4_n-AB1BANK224
50pcie_rx5_n-AD1BANK224
51pcie_rx6_n-AE3BANK224
52pcie_rx7_n-AF1BANK224
53pcie_tx0_p-R5BANK225
54pcie_tx1_p-U5BANK225
55pcie_tx2_p-W5BANK225
56pcie_tx3_p-AA5BANK225
57pcie_tx4_p-AC5BANK224
58pcie_tx5_p-AD7BANK224
59pcie_tx6_p-AE9BANK224
60pcie_tx7_p-AF7BANK224
61pcie_rx0_p-P2BANK225
62pcie_rx1_p-T2BANK225
63pcie_rx2_p-V2BANK225
64pcie_rx3_p-Y2BANK225
65pcie_rx4_p-AB2BANK224
66pcie_rx5_p-AD2BANK224
67pcie_rx6_p-AE4BANK224
68pcie_rx7_p-AF2BANK224
69pcie_perstn_rstLVCMOS18A9BANK86

Global clock
#

On high end FPGAs like the UltraScale+ family, high-speed global clocks are typically driven from external sources using differential pairs for better signal integrity.

According to the pinout we have two such differential pairs.

First I must determine the nature of these external reference clocks to see how I can use them to drive my clocks.

These differential pairs are provided over the following pins:

  • 100MHz : {E18, D18}
  • 156.25MHz : {K7, K6}

Judging by the naming and the frequencies, the 156.25MHz clock is likely my SFP reference clock, and the 100MHz can be used as my global clock.

We can confirm by querying the pin properties.

K6 properties :

Vivado% report_property [get_package_pins K6]
Property                Type    Read-only  Value
BANK                    string  true       227
BUFIO_2_REGION          string  true       TR
CLASS                   string  true       package_pin
DIFF_PAIR_PIN           string  true       K7
IS_BONDED               bool    true       1
IS_DIFFERENTIAL         bool    true       1
IS_GENERAL_PURPOSE      bool    true       0
IS_GLOBAL_CLK           bool    true       0
IS_LOW_CAP              bool    true       0
IS_MASTER               bool    true       0
IS_VREF                 bool    true       0
IS_VRN                  bool    true       0
IS_VRP                  bool    true       0
MAX_DELAY               int     true       38764
MIN_DELAY               int     true       38378
NAME                    string  true       K6
PIN_FUNC                enum    true       MGTREFCLK0N_227
PIN_FUNC_COUNT          int     true       1
PKGPIN_BYTEGROUP_INDEX  int     true       0
PKGPIN_NIBBLE_INDEX     int     true       0

E18 properties :

Vivado% report_property [get_package_pins E18]
Property                Type    Read-only  Value
BANK                    string  true       67
BUFIO_2_REGION          string  true       TL
CLASS                   string  true       package_pin
DIFF_PAIR_PIN           string  true       D18
IS_BONDED               bool    true       1
IS_DIFFERENTIAL         bool    true       1
IS_GENERAL_PURPOSE      bool    true       1
IS_GLOBAL_CLK           bool    true       1
IS_LOW_CAP              bool    true       0
IS_MASTER               bool    true       1
IS_VREF                 bool    true       0
IS_VRN                  bool    true       0
IS_VRP                  bool    true       0
MAX_DELAY               int     true       87126
MIN_DELAY               int     true       86259
NAME                    string  true       E18
PIN_FUNC                enum    true       IO_L11P_T1U_N8_GC_67
PIN_FUNC_COUNT          int     true       2
PKGPIN_BYTEGROUP_INDEX  int     true       8
PKGPIN_NIBBLE_INDEX     int     true       2

This tells us:

  • The differential pairings are correct: {K6, K7}, {E18, D18}
  • We can easily use the 100MHz as a source to drive our global clocking network
  • The 156.25MHz clock is to be used as the reference clock for our GTY transceivers and lands on bank 227 as indicated by the PIN_FUNC property MGTREFCLK0N_227
  • We cannot directly use the 156.25MHz clock to drive our global clock network

With all this we have sufficient information to write a constraint file (xdc) for this board.

Test design
#

Further sections will be using the following design files.

top.v:

module top (
    input wire Clk_100mhz_p_i, 
    input wire Clk_100mhz_n_i,

    output wire [3:0] Led_o 
);
    wire        clk_ibuf;
    reg  [28:0] ctr_q; 
    reg         unused_ctr_q;


    IBUFDS #(
        .DIFF_TERM("TRUE"),
        .IOSTANDARD("LVDS")
    ) m_ibufds (
        .I(Clk_100mhz_p_i),
        .IB(Clk_100mhz_n_i),
        .O(clk_ibuf)
    );

    BUFG m_bufg (
        .I(clk_ibuf),
        .O(clk)
    );

    always @(posedge clk)
        { unused_ctr_q, ctr_q } <= ctr_q + 29'b1;    
    
    assign Led_o = ctr_q[28:25];
endmodule

alibaba_cloud.xdc :

# Global clock signal 
set_property -dict {LOC E18 IOSTANDARD LVDS} [get_ports Clk_100mhz_p_i]
set_property -dict {LOC D18 IOSTANDARD LVDS} [get_ports Clk_100mhz_n_i]
create_clock -period 10 -name clk_100mhz [get_ports Clk_100mhz_p_i]

# LEDS
set_property -dict {LOC B11 IOSTANDARD LVCMOS18} [get_ports { Led_o[0]}]
set_property -dict {LOC C11 IOSTANDARD LVCMOS18} [get_ports { Led_o[1]}]
set_property -dict {LOC A10 IOSTANDARD LVCMOS18} [get_ports { Led_o[2]}]
set_property -dict {LOC B10 IOSTANDARD LVCMOS18} [get_ports { Led_o[3]}]

Writing the bitstream
#

My personal belief is that one of the most important contributors to design quality is iteration cost. The lower your iteration cost, the higher your design quality is going to be.

As such I will invest the small upfront cost to have the workflow be as streamlined as efficiently feasible.

Thus, my workflow evolved into doing practically everything over the command line interfaces and only interacting with the tools, Vivado in this case, through tcl scripts.

Vivado flow
#

The goal of this flow is to, given a few verilog design and constraint files produce a SVF file. Our steps are :

  1. creat the Vivado project setup.tcl
  2. run the implementation build.tcl
  3. generate the bitstream and the SVF gen.tcl

I will be using make to kick off and manage the dependencies between the different steps, though I recognise this isn’t a widespread practice for hardware projects. make is a highly flexible, reliable and powerful tool and I believe its ability to tie together any type of workflow makes it a prime tool for this use case.

We will be invoking Vivado in batch mode, this allows us to provide a tcl script alongside script arguments, the format is as following :

vivado -mode batch <path to tcl script> -tclargs <script args>

Though this allows us to easily break down our flow into incremental stages, invoking a single script in batch mode has the drawback of restarting Vivado and needing to re-load the project or the project checkpoint on each invocation.

As the project size grows so will the project load time, so segmenting the flow into a large number of independent scripts comes at an increasing cost.

Makefile :

SHELL := /bin/bash

VIVADO_PRJ_DIR=prj
VIVADO_PRJ_NAME=$(VIVADO_PRJ_DIR)
VIVADO_PRJ_PATH=$(VIVADO_PRJ_DIR)/$(VIVADO_PRJ_NAME).xpr
VIVADO_CHECKPOINT_PATH=$(VIVADO_PRJ_DIR)/$(VIVADO_PRJ_NAME)_checkpoint.dcp

VIVADO_CMD=vivado -mode batch -source

SRC_PATH=src
OUT_DIR=out


all: setup build gen

$(VIVADO_PRJ_PATH):  
    mkdir -p $(VIVADO_PRJ_DIR)
    $(VIVADO_CMD) setup.tcl -tclargs $(VIVADO_PRJ_DIR) $(VIVADO_PRJ_NAME)

setup: $(VIVADO_PRJ_PATH) 

$(VIVADO_CHECKPOINT_PATH): $(VIVADO_PRJ_PATH) $(wildcard $(SRC_PATH)/*.xdc) $(wildcard $(SRC_PATH)/*.v)
    $(VIVADO_CMD) build.tcl -tclargs $(VIVADO_PRJ_PATH) $(SRC_PATH) $(VIVADO_CHECKPOINT_PATH)

build: $(VIVADO_CHECKPOINT_PATH)

$(OUT_DIR)/$(VIVADO_PRJ_NAME).svf: $(VIVADO_CHECKPOINT_PATH) 
    mkdir -p $(OUT_DIR)
    $(VIVADO_CMD) gen.tcl -tclargs $(VIVADO_CHECKPOINT_PATH) $(OUT_DIR)

gen: $(OUT_DIR)/$(VIVADO_PRJ_NAME).svf

flash: $(OUT_DIR)/$(VIVADO_PRJ_NAME).svf
    openocd	

clean: 
    rm -rf $(VIVADO_PRJ_DIR)
    rm -rf $(OUT_DIR)
    rm -f vivado*{log,jou}
    rm -f webtalk*{log,jou}
    rm -f usage_statistics_webtalk*{html,xml}

setup.tcl :

set project_dir [lindex $argv 0]
set project_name [lindex $argv 1]

puts "Creating project $project_name at path [pwd]/$project_dir"
create_project -part xcku3p-ffvb676-2-e -force $project_name $project_dir

close_project
exit 0

build.tcl :

set project_path [lindex $argv 0]
set src_path [lindex $argv 1]
set checkpoint_path [lindex $argv 2]
puts "Implementation script called with project path $project_path and src path $src_path, generating checkpoint at $checkpoint_path"

open_project $project_path 

# load src
read_verilog [glob -directory $src_path *.v]
read_xdc [glob -directory $src_path *.xdc]


# synth
synth_design -top top

# implement
opt_design
place_design
route_design
phys_opt_design

write_checkpoint $checkpoint_path -force 
close_project
exit 0

Generating the SVF file
#

The SVF for Serial Vector Format is a human readable, vendor agnostic specification used to specify JTAG bus operations.

Example SVF file, test program:

! Initialize UUT
STATE RESET;
! End IR scans in DRPAUSE
ENDIR DRPAUSE;
! End DR scans in DRPAUSE
ENDDR DRPAUSE;
! 24 bit IR header
HIR 24 TDI (FFFFFF);
! 3 bit DR header
HDR 3 TDI (7);
! 16 bit IR trailer
TIR 16 TDI (FFFF);
! 2 bit DR trailer
TDR 2 TDI (3);
! 8 bit IR scan, load BIST opcode
SIR 8 TDI (41) TDO (81) MASK (FF);
! 16 bit DR scan, load BIST seed
SDR 16 TDI (ABCD);
! RUNBIST for 95 TCK Clocks
RUNTEST 95 TCK ENDSTATE IRPAUSE;
! 16 bit DR scan, check BIST status
SDR 16 TDI (0000) TDO(1234) MASK(FFFF);
! Enter Test-Logic-Reset
STATE RESET;
! End Test Program

Vivado can generate a hardware aware SVF file containing the configuration sequence for an FPGA board, allowing us to write a bitstream.

Given the SVF file literally contains the bitstream written in clear hexademical, in the file, our first step is to generate our design’s bitstream.

Vivado proper isn’t the software that generates the SVF file, this task is done by the hardware manager which handles all of the configuration.

We can launch a new instance open_hw_manager and connect to it connect_hw_server. Since JTAG is a daisy chained bus, and given the SVF file is just a standardised way of specifying JTAG bus operations, in order to generate a correct JTAG configuration sequence, we must inform the hardware manger of our scan chain.

During our earlier probing of the scan chain, we have established that our FPGA is the only device on the chain. We inform the hardware manager of this by creating a new device configuration ( the term “device” refers to the “board” here ) and add our fpga to the chain using the create_hw_device -part <device name>.When we have multiple devices we should register them following the order in which they appear on the chain.

Finally to generate the SVF file, we must select the device we wish to program with program_hw_device <hw_device>, then write out the SVF to the file using write_hw_svf <path to svf file>.

gen.tcl:

set checkpoint_path [lindex $argv 0]
set out_dir [lindex $argv 1]
puts "SVF generation script called with checkpoint path $checkpoint_path, generating to $out_dir"

open_checkpoint $checkpoint_path

# defines
set hw_target "alibaba_board_svf_target"
set fpga_device "xcku3p"
set bin_path "$out_dir/[current_project]"

write_bitstream "$bin_path.bit" -force

open_hw_manager

# connect to hw server with default config
connect_hw_server
puts "connected to hw server at [current_hw_server]"

create_hw_target $hw_target
puts "current hw target [current_hw_target]"

open_hw_target

# single device on scan chain
create_hw_device -part $fpga_device
puts "scan chain : [get_hw_devices]"

set_property PROGRAM.FILE "$bin_path.bit" [get_hw_device]

#select device to program
program_hw_device [get_hw_device]

# generate svf file
write_hw_svf -force "$bin_path.svf"

close_hw_manager
exit 0

Configuring the FPGA using OpenOCD
#

Although not widespread openOCD has a very nice svf execution command :

18.1 SVF: Serial Vector Format
#

The Serial Vector Format, better known as SVF, is a way to represent JTAG test patterns in text files. In a debug session using JTAG for its transport protocol, OpenOCD supports running such test files.

[Command]svf filename [-tap tapname] [[-]quiet] [[-]nil] [[-]progress]
[[-]ignore_error]

This issues a JTAG reset (Test-Logic-Reset) and then runs the SVF script from filename. Arguments can be specified in any order; the optional dash doesn’t affect their se- mantics.

Command options:

  • -tap tapname ignore IR and DR headers and footers specified by the SVF file with HIR, TIR, HDR and TDR commands; instead, calculate them automatically according to the current JTAG chain configuration, targeting tapname;
  • [-]quiet do not log every command before execution;
  • [-]nil “dry run”, i.e., do not perform any operations on the real interface;
  • [-]progress enable progress indication;
  • [-]ignore_error continue execution despite TDO check errors.

We invoke it in our openOCD script using the -progress option for additional logging.

openocd :

set svf_path "out/project_prj_checkpoint.svf"

source [find interface/jlink.cfg]
transport select jtag

set SPEED 1
jtag_rclk $SPEED
adapter speed $SPEED 
reset_config none

# jlink config

set CHIPNAME XCKU3P
set CHIP $CHIPNAME
puts "set chipname "$CHIP

source [find ../openocd/tcl/cpld/xilinx-xcu.cfg]

source [find ../openocd/tcl/fpga/xilinx-sysmon.cfg]

init 

puts "--------------------"

sysmon_report $CHIP.tap

puts "--------------------"

# program
if {![file exists $svf_path]} {
    puts "Svf path not found : $svf_path"
    exit
}

svf $svf_path -progress 
 
exit 

Flashing sequence log :

gp@workhorse:~/tools/openocd_jlink_test$ openocd
Open On-Chip Debugger 0.12.0+dev-02170-gfcff4b712 (2025-09-04-21:02)
Licensed under GNU GPL v2
For bug reports, read
	http://openocd.org/doc/doxygen/bugs.html
set chipname XCKU3P
Read temperature sysmon 4
Info : J-Link V10 compiled Jan 30 2023 11:28:07
Info : Hardware version: 10.10
Info : VTarget = 1.812 V
Info : clock speed 1 kHz
Info : JTAG tap: XCKU3P.tap tap/device found: 0x04a63093 (mfg: 0x049 (Xilinx), part: 0x4a63, ver: 0x0)
Warn : gdb services need one or more targets defined
--------------------
Sysmon status report :
TEMP 50.46 C
MAXTEMP 52.79 C
VCCINT 0.846 V
MAXVCC 0.860 V
VCCAUX 1.799 V
MAXVCCAUX 1.809 V
--------------------
svf processing file: "out/project_prj_checkpoint.svf"
  0%  TRST OFF;
  0%  ENDIR IDLE;
  0%  ENDDR IDLE;
  0%  STATE RESET;
  0%  STATE IDLE;
  0%  FREQUENCY 1.00E+07 HZ;
adapter speed: 10000 kHz
  0%  HIR 0 ;
  0%  TIR 0 ;
  0%  HDR 0 ;
  0%  TDR 0 ;
  0%  SIR 6 TDI (09) ;
  0%  SDR 32 TDI (00000000) TDO (04a63093) MASK (0fffffff) ;
  0%  STATE RESET;
  0%  STATE IDLE;
  0%  SIR 6 TDI (0b) ;
  0%  SIR 6 TDI (14) ;
  0%  RUNTEST 0.100000 SEC;
  0%  RUNTEST 10000 TCK;
  0%  SIR 6 TDI (14) TDO (11) MASK (31) ;
  0%  SIR 6 TDI (05) ;
 95%  ffffffffffff) ;
 95%  SIR 6 TDI (09) TDO (31) MASK (11) ;
 95%  STATE RESET;
 95%  RUNTEST 5 TCK;
 95%  SIR 6 TDI (05) ;
 95%  SDR 160 TDI (0000000400000004800700140000000466aa9955) ;
 95%  SIR 6 TDI (04) ;
 95%  SDR 32 TDI (00000000) TDO (3f5e0d40) MASK (08000000) ;
 95%  STATE RESET;
 95%  RUNTEST 5 TCK;
Info : Listening on port 6666 for tcl connections
Info : Listening on port 4444 for telnet connections

Resulting in a successfully configured our FPGA.

Conclusion
#

For $200 we got a fully working decommissioned Alibaba Cloud accelerator featuring a Kintex UltraScale+ FPGA with an easily accessible debugging/programming interface and enough pinout information to define our own constraint files.

We also have a fully automated Vivado workflow to implement our designs and the ability to write the bitstream, and interface with the FPGA’s internal JTAG accessible registers using an open source programming tool without the need for an official Xilinx programmer.

In the end, this project delivered an at least 5x cost savings over commercial boards (compared to the lowest cost $900-1050 Alinx alternatives), making this perhaps the most cost effective entry point for a Kintex UltraScale+ board.

External ressources
#

Xilinx Vivado Supported Devices : https://docs.amd.com/r/en-US/ug973-vivado-release-notes-install-license/Supported-Devices

Official Xilinx dev board : https://www.amd.com/en/products/adaptive-socs-and-fpgas/evaluation-boards/ek-u1-kcu116-g.html

Alinx Kintex UltraScale+ dev boards : https://www.en.alinx.com/Product/FPGA-Development-Boards/Kintex-UltraScale-plus.html

UltraScale Architecture Configuration User Guide (UG570) : https://docs.amd.com/r/en-US/ug570-ultrascale-configuration/Device-Resources-and-Configuration-Bitstream-Lengths?section=gyn1703168518425__table_vyh_4hs_szb

UltraScale Architecture System Monitor User Guide (UG580): https://docs.amd.com/v/u/en-US/ug580-ultrascale-sysmon

Vivado Design Suite Tcl Command Reference Guide (UG835): https://docs.amd.com/r/en-US/ug835-vivado-tcl-commands/Tcl-Initialization-Scripts

PCI vendor/device ID database: https://admin.pci-ids.ucw.cz/read/PC/14e4

PCI device classes: https://admin.pci-ids.ucw.cz/read/PD

Linux kernel PCI IDs: https://github.com/torvalds/linux/blob/7aac71907bdea16e2754a782b9d9155449a9d49d/include/linux/pci_ids.h#L160-L3256

Linux kernel PCI classes: https://github.com/torvalds/linux/blob/7aac71907bdea16e2754a782b9d9155449a9d49d/include/linux/pci_ids.h#L15-L158

Truck-kun pinout: https://blog.csdn.net/qq_37650251/article/details/145716953

Ebay listing: https://www.ebay.com/itm/167626831054?_trksid=p4375194.c101800.m5481

OpenOCD documentation: https://openocd.org/doc-release/pdf/openocd.pdf