Vulnerability-Lookup

PYSEC-2026-145

Vulnerability from pysec - Published: 2026-05-12 20:16 - Updated: 2026-05-20 09:19

Details

vLLM is an inference and serving engine for large language models (LLMs). From to before 0.20.0, the extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty). A single request with a penalty parameter (e.g., "repetition_penalty": 1.1) is sufficient to crash the server. This vulnerability is fixed in 0.20.0.

Severity

6.5 (Medium)


                  
                    CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

Impacted products

Name	purl
vllm	pkg:pypi/vllm

Aliases

JSON

To clipboard

{
  "affected": [
    {
      "package": {
        "ecosystem": "PyPI",
        "name": "vllm",
        "purl": "pkg:pypi/vllm"
      },
      "ranges": [
        {
          "events": [
            {
              "introduced": "0.18.0"
            },
            {
              "fixed": "0.20.0"
            }
          ],
          "type": "ECOSYSTEM"
        }
      ],
      "versions": [
        "0.18.0",
        "0.18.1",
        "0.19.0",
        "0.19.1"
      ]
    }
  ],
  "aliases": [
    "CVE-2026-44223",
    "GHSA-83vm-p52w-f9pw"
  ],
  "details": "vLLM is an inference and serving engine for large language models (LLMs). From  to before 0.20.0, the extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty). A single request with a penalty parameter (e.g., \"repetition_penalty\": 1.1) is sufficient to crash the server. This vulnerability is fixed in 0.20.0.",
  "id": "PYSEC-2026-145",
  "modified": "2026-05-20T09:19:21.596358Z",
  "published": "2026-05-12T20:16:43.293Z",
  "references": [
    {
      "type": "ADVISORY",
      "url": "https://github.com/vllm-project/vllm/security/advisories/GHSA-83vm-p52w-f9pw"
    },
    {
      "type": "FIX",
      "url": "https://github.com/vllm-project/vllm/pull/38610"
    }
  ],
  "severity": [
    {
      "score": "CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H",
      "type": "CVSS_V3"
    }
  ]
}

CVE-2026-44223 (GCVE-0-2026-44223)

Vulnerability from cvelistv5 – Published: 2026-05-12 19:58 – Updated: 2026-06-22 21:49

Title

vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters

Summary

vLLM is an inference and serving engine for large language models (LLMs). From 0.18.0 to before 0.20.0, the extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty). A single request with a penalty parameter (e.g., "repetition_penalty": 1.1) is sufficient to crash the server. This vulnerability is fixed in 0.20.0.

Severity

6.5 (Medium)


                        
                          CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

SSVC

Exploitation: poc Automatable: no Technical Impact: partial

CISA Coordinator (v2.0.3)

CWE

CWE-131 - Incorrect Calculation of Buffer Size
CWE-704 - Incorrect Type Conversion or Cast

Assigner

GitHub_M

References

2 references

URL	Tags
https://github.com/vllm-project/vllm/security/adv…	x_refsource_CONFIRM
https://github.com/vllm-project/vllm/pull/38610	x_refsource_MISC

Impacted products

1 product

Vendor	Product	Version
vllm-project	vllm	Affected: >= 0.18.0, < 0.20.0

Show details on NVD website

JSON

To clipboard

{
  "containers": {
    "adp": [
      {
        "metrics": [
          {
            "other": {
              "content": {
                "id": "CVE-2026-44223",
                "options": [
                  {
                    "Exploitation": "poc"
                  },
                  {
                    "Automatable": "no"
                  },
                  {
                    "Technical Impact": "partial"
                  }
                ],
                "role": "CISA Coordinator",
                "timestamp": "2026-05-15T14:44:05.012494Z",
                "version": "2.0.3"
              },
              "type": "ssvc"
            }
          }
        ],
        "providerMetadata": {
          "dateUpdated": "2026-05-15T14:46:25.695Z",
          "orgId": "134c704f-9b21-4f2e-91b3-4a467353bcc0",
          "shortName": "CISA-ADP"
        },
        "references": [
          {
            "tags": [
              "exploit"
            ],
            "url": "https://github.com/vllm-project/vllm/security/advisories/GHSA-83vm-p52w-f9pw"
          },
          {
            "tags": [
              "exploit"
            ],
            "url": "https://github.com/vllm-project/vllm/pull/38610"
          }
        ],
        "title": "CISA ADP Vulnrichment"
      }
    ],
    "cna": {
      "affected": [
        {
          "product": "vllm",
          "vendor": "vllm-project",
          "versions": [
            {
              "status": "affected",
              "version": "\u003e= 0.18.0, \u003c 0.20.0"
            }
          ]
        }
      ],
      "descriptions": [
        {
          "lang": "en",
          "value": "vLLM is an inference and serving engine for large language models (LLMs). From 0.18.0 to before 0.20.0, the extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty). A single request with a penalty parameter (e.g., \"repetition_penalty\": 1.1) is sufficient to crash the server. This vulnerability is fixed in 0.20.0."
        }
      ],
      "metrics": [
        {
          "cvssV3_1": {
            "attackComplexity": "LOW",
            "attackVector": "NETWORK",
            "availabilityImpact": "HIGH",
            "baseScore": 6.5,
            "baseSeverity": "MEDIUM",
            "confidentialityImpact": "NONE",
            "integrityImpact": "NONE",
            "privilegesRequired": "LOW",
            "scope": "UNCHANGED",
            "userInteraction": "NONE",
            "vectorString": "CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H",
            "version": "3.1"
          }
        }
      ],
      "problemTypes": [
        {
          "descriptions": [
            {
              "cweId": "CWE-131",
              "description": "CWE-131: Incorrect Calculation of Buffer Size",
              "lang": "en",
              "type": "CWE"
            }
          ]
        },
        {
          "descriptions": [
            {
              "cweId": "CWE-704",
              "description": "CWE-704: Incorrect Type Conversion or Cast",
              "lang": "en",
              "type": "CWE"
            }
          ]
        }
      ],
      "providerMetadata": {
        "dateUpdated": "2026-06-22T21:49:24.277Z",
        "orgId": "a0819718-46f1-4df5-94e2-005712e83aaa",
        "shortName": "GitHub_M"
      },
      "references": [
        {
          "name": "https://github.com/vllm-project/vllm/security/advisories/GHSA-83vm-p52w-f9pw",
          "tags": [
            "x_refsource_CONFIRM"
          ],
          "url": "https://github.com/vllm-project/vllm/security/advisories/GHSA-83vm-p52w-f9pw"
        },
        {
          "name": "https://github.com/vllm-project/vllm/pull/38610",
          "tags": [
            "x_refsource_MISC"
          ],
          "url": "https://github.com/vllm-project/vllm/pull/38610"
        }
      ],
      "source": {
        "advisory": "GHSA-83vm-p52w-f9pw",
        "discovery": "UNKNOWN"
      },
      "title": "vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters"
    }
  },
  "cveMetadata": {
    "assignerOrgId": "a0819718-46f1-4df5-94e2-005712e83aaa",
    "assignerShortName": "GitHub_M",
    "cveId": "CVE-2026-44223",
    "datePublished": "2026-05-12T19:58:40.862Z",
    "dateReserved": "2026-05-05T15:42:40.518Z",
    "dateUpdated": "2026-06-22T21:49:24.277Z",
    "state": "PUBLISHED"
  },
  "dataType": "CVE_RECORD",
  "dataVersion": "5.2"
}

GHSA-83VM-P52W-F9PW

Vulnerability from github – Published: 2026-05-06 21:45 – Updated: 2026-06-08 19:52

Summary

vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters

Details

Summary

The extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty).

A single request with a penalty parameter (e.g., "repetition_penalty": 1.1) is sufficient to crash the server. The crash is deterministic and immediate — no concurrency, race condition, or special workload is required.

Details

In vLLM v0.17.0, the extract_hidden_states proposer's propose() method returned sampled_token_ids.unsqueeze(-1), producing a tensor of shape (batch_size, 1).

In PR #37013 (first released in v0.18.0), the KV connector interface was refactored out of propose(). The return type changed from tuple[Tensor, KVConnectorOutput | None] to Tensor, and the .unsqueeze(-1) call was removed along with the KV connector output:

# Before (v0.17.0):
return sampled_token_ids.unsqueeze(-1), kv_connector_output  # shape (batch_size, 1)

# After (v0.18.0+):
return sampled_token_ids  # shape (batch_size, 2) after first decode step

The refactor missed that sampled_token_ids changed semantics between the first and subsequent decode steps. After the first decode step, the rejection sampler allocates its output as (batch_size, max_spec_len + 1). With num_speculative_tokens=1, this produces shape (batch_size, 2) instead of the expected (batch_size, 1), causing a broadcast shape mismatch during penalty application.

Impact

Any vLLM deployment between v0.18.0 and v0.19.1 (inclusive) configured with extract_hidden_states speculative decoding is affected. A single API request containing any penalty parameter immediately and permanently crashes the EngineCore process, resulting in complete loss of service availability.

Patches

Fixed in PR #38610, first included in vLLM v0.20.0. The fix slices the return value to sampled_token_ids[:, :1], ensuring the correct (batch_size, 1) shape regardless of the rejection sampler's output dimensions.

Workarounds

Upgrade to vLLM v0.20.0 or later.
If upgrading is not possible, avoid using extract_hidden_states as the speculative decoding method on affected versions.
Alternatively, reject or strip penalty parameters (repetition_penalty, frequency_penalty, presence_penalty) from incoming requests at an API gateway before they reach vLLM.

Severity

6.5 (Medium)


                  
                    CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

Show details on source website

JSON

To clipboard

{
  "affected": [
    {
      "package": {
        "ecosystem": "PyPI",
        "name": "vllm"
      },
      "ranges": [
        {
          "events": [
            {
              "introduced": "0.18.0"
            },
            {
              "fixed": "0.20.0"
            }
          ],
          "type": "ECOSYSTEM"
        }
      ]
    }
  ],
  "aliases": [
    "CVE-2026-44223"
  ],
  "database_specific": {
    "cwe_ids": [
      "CWE-131",
      "CWE-704"
    ],
    "github_reviewed": true,
    "github_reviewed_at": "2026-05-06T21:45:51Z",
    "nvd_published_at": "2026-05-12T20:16:43Z",
    "severity": "MODERATE"
  },
  "details": "### Summary\n\nThe `extract_hidden_states` speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a `RuntimeError` that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (`repetition_penalty`, `frequency_penalty`, or `presence_penalty`).\n\nA single request with a penalty parameter (e.g., `\"repetition_penalty\": 1.1`) is sufficient to crash the server. The crash is deterministic and immediate \u2014 no concurrency, race condition, or special workload is required.\n\n### Details\n\nIn vLLM v0.17.0, the `extract_hidden_states` proposer\u0027s `propose()` method returned `sampled_token_ids.unsqueeze(-1)`, producing a tensor of shape `(batch_size, 1)`.\n\nIn [PR #37013](https://github.com/vllm-project/vllm/pull/37013) (first released in v0.18.0), the KV connector interface was refactored out of `propose()`. The return type changed from `tuple[Tensor, KVConnectorOutput | None]` to `Tensor`, and the `.unsqueeze(-1)` call was removed along with the KV connector output:\n\n```python\n# Before (v0.17.0):\nreturn sampled_token_ids.unsqueeze(-1), kv_connector_output  # shape (batch_size, 1)\n\n# After (v0.18.0+):\nreturn sampled_token_ids  # shape (batch_size, 2) after first decode step\n```\n\nThe refactor missed that `sampled_token_ids` changed semantics between the first and subsequent decode steps. After the first decode step, the rejection sampler allocates its output as `(batch_size, max_spec_len + 1)`. With `num_speculative_tokens=1`, this produces shape `(batch_size, 2)` instead of the expected `(batch_size, 1)`, causing a broadcast shape mismatch during penalty application.\n\n### Impact\n\nAny vLLM deployment between v0.18.0 and v0.19.1 (inclusive) configured with `extract_hidden_states` speculative decoding is affected. A single API request containing any penalty parameter immediately and permanently crashes the EngineCore process, resulting in complete loss of service availability.\n\n### Patches\n\nFixed in [PR #38610](https://github.com/vllm-project/vllm/pull/38610), first included in vLLM v0.20.0. The fix slices the return value to `sampled_token_ids[:, :1]`, ensuring the correct `(batch_size, 1)` shape regardless of the rejection sampler\u0027s output dimensions.\n\n### Workarounds\n\n- Upgrade to vLLM v0.20.0 or later.\n- If upgrading is not possible, avoid using `extract_hidden_states` as the speculative decoding method on affected versions.\n- Alternatively, reject or strip penalty parameters (`repetition_penalty`, `frequency_penalty`, `presence_penalty`) from incoming requests at an API gateway before they reach vLLM.",
  "id": "GHSA-83vm-p52w-f9pw",
  "modified": "2026-06-08T19:52:35Z",
  "published": "2026-05-06T21:45:51Z",
  "references": [
    {
      "type": "WEB",
      "url": "https://github.com/vllm-project/vllm/security/advisories/GHSA-83vm-p52w-f9pw"
    },
    {
      "type": "ADVISORY",
      "url": "https://nvd.nist.gov/vuln/detail/CVE-2026-44223"
    },
    {
      "type": "WEB",
      "url": "https://github.com/vllm-project/vllm/pull/38610"
    },
    {
      "type": "WEB",
      "url": "https://github.com/pypa/advisory-database/tree/main/vulns/vllm/PYSEC-2026-145.yaml"
    },
    {
      "type": "PACKAGE",
      "url": "https://github.com/vllm-project/vllm"
    }
  ],
  "schema_version": "1.4.0",
  "severity": [
    {
      "score": "CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H",
      "type": "CVSS_V3"
    }
  ],
  "summary": "vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters"
}

Sightings

Author	Source	Type	Date	Other

Nomenclature

Seen: The vulnerability was mentioned, discussed, or observed by the user.
Confirmed: The vulnerability has been validated from an analyst's perspective.
Published Proof of Concept: A public proof of concept is available for this vulnerability.
Exploited: The vulnerability was observed as exploited by the user who reported the sighting.
Patched: The vulnerability was observed as successfully patched by the user who reported the sighting.
Not exploited: The vulnerability was not observed as exploited by the user who reported the sighting.
Not confirmed: The user expressed doubt about the validity of the vulnerability.
Not patched: The vulnerability was not observed as successfully patched by the user who reported the sighting.

Detection rules are retrieved from Rulezet.

Action not permitted

PYSEC-2026-145

CVE-2026-44223 (GCVE-0-2026-44223)

GHSA-83VM-P52W-F9PW

Summary

Details

Impact

Patches

Workarounds

Tags

Sightings

Nomenclature