ghsa-43g8-4r39-vq2f
Vulnerability from github
Published
2024-07-12 15:31
Modified
2024-07-12 15:31
Details

In the Linux kernel, the following vulnerability has been resolved:

parisc: Try to fix random segmentation faults in package builds

PA-RISC systems with PA8800 and PA8900 processors have had problems with random segmentation faults for many years. Systems with earlier processors are much more stable.

Systems with PA8800 and PA8900 processors have a large L2 cache which needs per page flushing for decent performance when a large range is flushed. The combined cache in these systems is also more sensitive to non-equivalent aliases than the caches in earlier systems.

The majority of random segmentation faults that I have looked at appear to be memory corruption in memory allocated using mmap and malloc.

My first attempt at fixing the random faults didn't work. On reviewing the cache code, I realized that there were two issues which the existing code didn't handle correctly. Both relate to cache move-in. Another issue is that the present bit in PTEs is racy.

1) PA-RISC caches have a mind of their own and they can speculatively load data and instructions for a page as long as there is a entry in the TLB for the page which allows move-in. TLBs are local to each CPU. Thus, the TLB entry for a page must be purged before flushing the page. This is particularly important on SMP systems.

In some of the flush routines, the flush routine would be called and then the TLB entry would be purged. This was because the flush routine needed the TLB entry to do the flush.

2) My initial approach to trying the fix the random faults was to try and use flush_cache_page_if_present for all flush operations. This actually made things worse and led to a couple of hardware lockups. It finally dawned on me that some lines weren't being flushed because the pte check code was racy. This resulted in random inequivalent mappings to physical pages.

The __flush_cache_page tmpalias flush sets up its own TLB entry and it doesn't need the existing TLB entry. As long as we can find the pte pointer for the vm page, we can get the pfn and physical address of the page. We can also purge the TLB entry for the page before doing the flush. Further, __flush_cache_page uses a special TLB entry that inhibits cache move-in.

When switching page mappings, we need to ensure that lines are removed from the cache. It is not sufficient to just flush the lines to memory as they may come back.

This made it clear that we needed to implement all the required flush operations using tmpalias routines. This includes flushes for user and kernel pages.

After modifying the code to use tmpalias flushes, it became clear that the random segmentation faults were not fully resolved. The frequency of faults was worse on systems with a 64 MB L2 (PA8900) and systems with more CPUs (rp4440).

The warning that I added to flush_cache_page_if_present to detect pages that couldn't be flushed triggered frequently on some systems.

Helge and I looked at the pages that couldn't be flushed and found that the PTE was either cleared or for a swap page. Ignoring pages that were swapped out seemed okay but pages with cleared PTEs seemed problematic.

I looked at routines related to pte_clear and noticed ptep_clear_flush. The default implementation just flushes the TLB entry. However, it was obvious that on parisc we need to flush the cache page as well. If we don't flush the cache page, stale lines will be left in the cache and cause random corruption. Once a PTE is cleared, there is no way to find the physical address associated with the PTE and flush the associated page at a later time.

I implemented an updated change with a parisc specific version of ptep_clear_flush. It fixed the random data corruption on Helge's rp4440 and rp3440, as well as on my c8000.

At this point, I realized that I could restore the code where we only flush in flush_cache_page_if_present if the page has been accessed. However, for this, we also need to flush the cache when the accessed bit is cleared in ---truncated---

Show details on source website


{
  "affected": [],
  "aliases": [
    "CVE-2024-40918"
  ],
  "database_specific": {
    "cwe_ids": [],
    "github_reviewed": false,
    "github_reviewed_at": null,
    "nvd_published_at": "2024-07-12T13:15:14Z",
    "severity": null
  },
  "details": "In the Linux kernel, the following vulnerability has been resolved:\n\nparisc: Try to fix random segmentation faults in package builds\n\nPA-RISC systems with PA8800 and PA8900 processors have had problems\nwith random segmentation faults for many years.  Systems with earlier\nprocessors are much more stable.\n\nSystems with PA8800 and PA8900 processors have a large L2 cache which\nneeds per page flushing for decent performance when a large range is\nflushed. The combined cache in these systems is also more sensitive to\nnon-equivalent aliases than the caches in earlier systems.\n\nThe majority of random segmentation faults that I have looked at\nappear to be memory corruption in memory allocated using mmap and\nmalloc.\n\nMy first attempt at fixing the random faults didn\u0027t work. On\nreviewing the cache code, I realized that there were two issues\nwhich the existing code didn\u0027t handle correctly. Both relate\nto cache move-in. Another issue is that the present bit in PTEs\nis racy.\n\n1) PA-RISC caches have a mind of their own and they can speculatively\nload data and instructions for a page as long as there is a entry in\nthe TLB for the page which allows move-in. TLBs are local to each\nCPU. Thus, the TLB entry for a page must be purged before flushing\nthe page. This is particularly important on SMP systems.\n\nIn some of the flush routines, the flush routine would be called\nand then the TLB entry would be purged. This was because the flush\nroutine needed the TLB entry to do the flush.\n\n2) My initial approach to trying the fix the random faults was to\ntry and use flush_cache_page_if_present for all flush operations.\nThis actually made things worse and led to a couple of hardware\nlockups. It finally dawned on me that some lines weren\u0027t being\nflushed because the pte check code was racy. This resulted in\nrandom inequivalent mappings to physical pages.\n\nThe __flush_cache_page tmpalias flush sets up its own TLB entry\nand it doesn\u0027t need the existing TLB entry. As long as we can find\nthe pte pointer for the vm page, we can get the pfn and physical\naddress of the page. We can also purge the TLB entry for the page\nbefore doing the flush. Further, __flush_cache_page uses a special\nTLB entry that inhibits cache move-in.\n\nWhen switching page mappings, we need to ensure that lines are\nremoved from the cache.  It is not sufficient to just flush the\nlines to memory as they may come back.\n\nThis made it clear that we needed to implement all the required\nflush operations using tmpalias routines. This includes flushes\nfor user and kernel pages.\n\nAfter modifying the code to use tmpalias flushes, it became clear\nthat the random segmentation faults were not fully resolved. The\nfrequency of faults was worse on systems with a 64 MB L2 (PA8900)\nand systems with more CPUs (rp4440).\n\nThe warning that I added to flush_cache_page_if_present to detect\npages that couldn\u0027t be flushed triggered frequently on some systems.\n\nHelge and I looked at the pages that couldn\u0027t be flushed and found\nthat the PTE was either cleared or for a swap page. Ignoring pages\nthat were swapped out seemed okay but pages with cleared PTEs seemed\nproblematic.\n\nI looked at routines related to pte_clear and noticed ptep_clear_flush.\nThe default implementation just flushes the TLB entry. However, it was\nobvious that on parisc we need to flush the cache page as well. If\nwe don\u0027t flush the cache page, stale lines will be left in the cache\nand cause random corruption. Once a PTE is cleared, there is no way\nto find the physical address associated with the PTE and flush the\nassociated page at a later time.\n\nI implemented an updated change with a parisc specific version of\nptep_clear_flush. It fixed the random data corruption on Helge\u0027s rp4440\nand rp3440, as well as on my c8000.\n\nAt this point, I realized that I could restore the code where we only\nflush in flush_cache_page_if_present if the page has been accessed.\nHowever, for this, we also need to flush the cache when the accessed\nbit is cleared in\n---truncated---",
  "id": "GHSA-43g8-4r39-vq2f",
  "modified": "2024-07-12T15:31:27Z",
  "published": "2024-07-12T15:31:27Z",
  "references": [
    {
      "type": "ADVISORY",
      "url": "https://nvd.nist.gov/vuln/detail/CVE-2024-40918"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/5bf196f1936bf93df31112fbdfb78c03537c07b0"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/72d95924ee35c8cd16ef52f912483ee938a34d49"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/d66f2607d89f760cdffed88b22f309c895a2af20"
    }
  ],
  "schema_version": "1.4.0",
  "severity": []
}


Log in or create an account to share your comment.




Tags
Taxonomy of the tags.


Loading…

Loading…

Loading…

Sightings

Author Source Type Date

Nomenclature

  • Seen: The vulnerability was mentioned, discussed, or seen somewhere by the user.
  • Confirmed: The vulnerability is confirmed from an analyst perspective.
  • Exploited: This vulnerability was exploited and seen by the user reporting the sighting.
  • Patched: This vulnerability was successfully patched by the user reporting the sighting.
  • Not exploited: This vulnerability was not exploited or seen by the user reporting the sighting.
  • Not confirmed: The user expresses doubt about the veracity of the vulnerability.
  • Not patched: This vulnerability was not successfully patched by the user reporting the sighting.