We have a Dell C410x PCI-e expansion chassis fitted with 3 x M2090 GPU cards and 2 x K20 cards. The host server is a C6100 fitted with two blades with iPASS connects to the C410x so that the M2090 cards connect to one blade and the K20 cards connect to the other blade. All of this is working fine with Ubuntu 14.04 installed on the host servers.
Recently we purchased three more K20 GPU cards from Dell and fiited them alongside the existing two K20 cards. Although all 5 cards are listed when running the lspci utility, the nVidia driver does not load. After some experimentation I found that the kernel cannot allocate memory for the PCI-e BAR (base address registers) if more than 2 GPUs are installed - for the nVidia K20 cards lspci -vv reports things like:
Region 0: Memory at c1000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at <ignored> (64-bit, prefetchable)
Region 3: Memory at <ignored> (64-bit, prefetchable)
and there are lots of errors in dmesg like:
[ 0.573153] pci 0000:04:00.0: BAR 13: can't assign io (size 0xe000)
[ 0.573236] pci 0000:05:08.0: BAR 14: can't assign mem (size 0x200000)
[ 0.573320] pci 0000:05:08.0: BAR 15: can't assign mem pref (size 0x200000)
The nVidia driver reports
[ 17.028050] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 17.028050] NVRM: BAR1 is 0M @ 0x0 (PCI:0000:14:00.0)
[ 17.028052] NVRM: The system BIOS may have misconfigured your GPU.
[ 17.028056] nvidia: probe of 0000:14:00.0 failed with error -1
[ 17.028241] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 17.028243] NVRM: None of the NVIDIA graphics adapters were initialized!
[ 17.028244] [drm] Module unloaded
[ 17.028314] NVRM: NVIDIA init module failed!
The BIOS on the C6100 blade was version 1.66 and I have updated this to the latest 1.71 version but the problem remains.
There is another interesting thing: the original K20 GPU cards we bought last year have serial nos ending in 0009 and 0010 - the new ones we have just received have serial nos 0012, 0058 and 0064.
Now, the card with serial # 0012 works on it's own or with either of the existing cards fitted but cards 0058 and 0064 do not work even it fitted on their own - we are seeing the same error messages reported by dmesg and lspci as we get when fitting more than 2 GPU cards. I'm guessing these are later K20 cards that are different from the earlier ones?
It seems there is a limit to the number of K20s that can be used with the C6100 and/or compatibility issues with later cards. Before I return these GPUs to Dell, does anyone know of a solution?
Thanks,
Andy