402 lines
14 KiB
Diff
402 lines
14 KiB
Diff
|
From d1ffadc67e2eee2d5f8626dca6646e70e3aa9d76 Mon Sep 17 00:00:00 2001
|
||
|
From: Andy Lutomirski <luto@kernel.org>
|
||
|
Date: Mon, 9 Oct 2017 09:50:49 -0700
|
||
|
Subject: [PATCH 045/231] x86/mm: Flush more aggressively in lazy TLB mode
|
||
|
MIME-Version: 1.0
|
||
|
Content-Type: text/plain; charset=UTF-8
|
||
|
Content-Transfer-Encoding: 8bit
|
||
|
|
||
|
CVE-2017-5754
|
||
|
|
||
|
Since commit:
|
||
|
|
||
|
94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
|
||
|
|
||
|
x86's lazy TLB mode has been all the way lazy: when running a kernel thread
|
||
|
(including the idle thread), the kernel keeps using the last user mm's
|
||
|
page tables without attempting to maintain user TLB coherence at all.
|
||
|
|
||
|
From a pure semantic perspective, this is fine -- kernel threads won't
|
||
|
attempt to access user pages, so having stale TLB entries doesn't matter.
|
||
|
|
||
|
Unfortunately, I forgot about a subtlety. By skipping TLB flushes,
|
||
|
we also allow any paging-structure caches that may exist on the CPU
|
||
|
to become incoherent. This means that we can have a
|
||
|
paging-structure cache entry that references a freed page table, and
|
||
|
the CPU is within its rights to do a speculative page walk starting
|
||
|
at the freed page table.
|
||
|
|
||
|
I can imagine this causing two different problems:
|
||
|
|
||
|
- A speculative page walk starting from a bogus page table could read
|
||
|
IO addresses. I haven't seen any reports of this causing problems.
|
||
|
|
||
|
- A speculative page walk that involves a bogus page table can install
|
||
|
garbage in the TLB. Such garbage would always be at a user VA, but
|
||
|
some AMD CPUs have logic that triggers a machine check when it notices
|
||
|
these bogus entries. I've seen a couple reports of this.
|
||
|
|
||
|
Boris further explains the failure mode:
|
||
|
|
||
|
> It is actually more of an optimization which assumes that paging-structure
|
||
|
> entries are in WB DRAM:
|
||
|
>
|
||
|
> "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
|
||
|
> performance optimization that assumes PML4, PDP, PDE, and PTE entries
|
||
|
> are in cacheable WB-DRAM; memory type checks may be bypassed, and
|
||
|
> addresses outside of WB-DRAM may result in undefined behavior or NB
|
||
|
> protocol errors. 1=Disables performance optimization and allows PML4,
|
||
|
> PDP, PDE and PTE entries to be in any memory type. Operating systems
|
||
|
> that maintain page tables in memory types other than WB- DRAM must set
|
||
|
> TlbCacheDis to insure proper operation."
|
||
|
>
|
||
|
> The MCE generated is an NB protocol error to signal that
|
||
|
>
|
||
|
> "Link: A specific coherent-only packet from a CPU was issued to an
|
||
|
> IO link. This may be caused by software which addresses page table
|
||
|
> structures in a memory type other than cacheable WB-DRAM without
|
||
|
> properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
|
||
|
> example, when page table structure addresses are above top of memory. In
|
||
|
> such cases, the NB will generate an MCE if it sees a mismatch between
|
||
|
> the memory operation generated by the core and the link type."
|
||
|
>
|
||
|
> I'm assuming coherent-only packets don't go out on IO links, thus the
|
||
|
> error.
|
||
|
|
||
|
To fix this, reinstate TLB coherence in lazy mode. With this patch
|
||
|
applied, we do it in one of two ways:
|
||
|
|
||
|
- If we have PCID, we simply switch back to init_mm's page tables
|
||
|
when we enter a kernel thread -- this seems to be quite cheap
|
||
|
except for the cost of serializing the CPU.
|
||
|
|
||
|
- If we don't have PCID, then we set a flag and switch to init_mm
|
||
|
the first time we would otherwise need to flush the TLB.
|
||
|
|
||
|
The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
|
||
|
to override the default mode for benchmarking.
|
||
|
|
||
|
In theory, we could optimize this better by only flushing the TLB in
|
||
|
lazy CPUs when a page table is freed. Doing that would require
|
||
|
auditing the mm code to make sure that all page table freeing goes
|
||
|
through tlb_remove_page() as well as reworking some data structures
|
||
|
to implement the improved flush logic.
|
||
|
|
||
|
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
|
||
|
Reported-by: Adam Borowski <kilobyte@angband.pl>
|
||
|
Signed-off-by: Andy Lutomirski <luto@kernel.org>
|
||
|
Signed-off-by: Borislav Petkov <bp@suse.de>
|
||
|
Cc: Borislav Petkov <bp@alien8.de>
|
||
|
Cc: Brian Gerst <brgerst@gmail.com>
|
||
|
Cc: Daniel Borkmann <daniel@iogearbox.net>
|
||
|
Cc: Eric Biggers <ebiggers@google.com>
|
||
|
Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
|
||
|
Cc: Kees Cook <keescook@chromium.org>
|
||
|
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
|
||
|
Cc: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
Cc: Nadav Amit <nadav.amit@gmail.com>
|
||
|
Cc: Peter Zijlstra <peterz@infradead.org>
|
||
|
Cc: Rik van Riel <riel@redhat.com>
|
||
|
Cc: Roman Kagan <rkagan@virtuozzo.com>
|
||
|
Cc: Thomas Gleixner <tglx@linutronix.de>
|
||
|
Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
|
||
|
Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic
|
||
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
(backported from commit b956575bed91ecfb136a8300742ecbbf451471ab)
|
||
|
Signed-off-by: Andy Whitcroft <apw@canonical.com>
|
||
|
Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
|
||
|
(cherry picked from commit a4bb9409c548ece51ec246fc5113a32b8d130142)
|
||
|
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
|
||
|
---
|
||
|
arch/x86/include/asm/mmu_context.h | 8 +-
|
||
|
arch/x86/include/asm/tlbflush.h | 24 ++++++
|
||
|
arch/x86/mm/tlb.c | 160 +++++++++++++++++++++++++------------
|
||
|
3 files changed, 136 insertions(+), 56 deletions(-)
|
||
|
|
||
|
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
|
||
|
index c120b5db178a..3c856a15b98e 100644
|
||
|
--- a/arch/x86/include/asm/mmu_context.h
|
||
|
+++ b/arch/x86/include/asm/mmu_context.h
|
||
|
@@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
|
||
|
DEBUG_LOCKS_WARN_ON(preemptible());
|
||
|
}
|
||
|
|
||
|
-static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
|
||
|
-{
|
||
|
- int cpu = smp_processor_id();
|
||
|
-
|
||
|
- if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
|
||
|
- cpumask_clear_cpu(cpu, mm_cpumask(mm));
|
||
|
-}
|
||
|
+void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
|
||
|
|
||
|
static inline int init_new_context(struct task_struct *tsk,
|
||
|
struct mm_struct *mm)
|
||
|
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
|
||
|
index d23e61dc0640..6533da3036c9 100644
|
||
|
--- a/arch/x86/include/asm/tlbflush.h
|
||
|
+++ b/arch/x86/include/asm/tlbflush.h
|
||
|
@@ -82,6 +82,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
|
||
|
#define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
|
||
|
#endif
|
||
|
|
||
|
+/*
|
||
|
+ * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point
|
||
|
+ * to init_mm when we switch to a kernel thread (e.g. the idle thread). If
|
||
|
+ * it's false, then we immediately switch CR3 when entering a kernel thread.
|
||
|
+ */
|
||
|
+DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
|
||
|
+
|
||
|
/*
|
||
|
* 6 because 6 should be plenty and struct tlb_state will fit in
|
||
|
* two cache lines.
|
||
|
@@ -104,6 +111,23 @@ struct tlb_state {
|
||
|
u16 loaded_mm_asid;
|
||
|
u16 next_asid;
|
||
|
|
||
|
+ /*
|
||
|
+ * We can be in one of several states:
|
||
|
+ *
|
||
|
+ * - Actively using an mm. Our CPU's bit will be set in
|
||
|
+ * mm_cpumask(loaded_mm) and is_lazy == false;
|
||
|
+ *
|
||
|
+ * - Not using a real mm. loaded_mm == &init_mm. Our CPU's bit
|
||
|
+ * will not be set in mm_cpumask(&init_mm) and is_lazy == false.
|
||
|
+ *
|
||
|
+ * - Lazily using a real mm. loaded_mm != &init_mm, our bit
|
||
|
+ * is set in mm_cpumask(loaded_mm), but is_lazy == true.
|
||
|
+ * We're heuristically guessing that the CR3 load we
|
||
|
+ * skipped more than makes up for the overhead added by
|
||
|
+ * lazy mode.
|
||
|
+ */
|
||
|
+ bool is_lazy;
|
||
|
+
|
||
|
/*
|
||
|
* Access to this CR4 shadow and to H/W CR4 is protected by
|
||
|
* disabling interrupts when modifying either one.
|
||
|
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
|
||
|
index 440400316c8a..b27aceaf7ed1 100644
|
||
|
--- a/arch/x86/mm/tlb.c
|
||
|
+++ b/arch/x86/mm/tlb.c
|
||
|
@@ -30,6 +30,8 @@
|
||
|
|
||
|
atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
|
||
|
|
||
|
+DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
|
||
|
+
|
||
|
static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
|
||
|
u16 *new_asid, bool *need_flush)
|
||
|
{
|
||
|
@@ -80,7 +82,7 @@ void leave_mm(int cpu)
|
||
|
return;
|
||
|
|
||
|
/* Warn if we're not lazy. */
|
||
|
- WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
|
||
|
+ WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
|
||
|
|
||
|
switch_mm(NULL, &init_mm, NULL);
|
||
|
}
|
||
|
@@ -140,52 +142,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
|
||
|
__flush_tlb_all();
|
||
|
}
|
||
|
#endif
|
||
|
+ this_cpu_write(cpu_tlbstate.is_lazy, false);
|
||
|
|
||
|
if (real_prev == next) {
|
||
|
VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
|
||
|
next->context.ctx_id);
|
||
|
|
||
|
- if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
|
||
|
- /*
|
||
|
- * There's nothing to do: we weren't lazy, and we
|
||
|
- * aren't changing our mm. We don't need to flush
|
||
|
- * anything, nor do we need to update CR3, CR4, or
|
||
|
- * LDTR.
|
||
|
- */
|
||
|
- return;
|
||
|
- }
|
||
|
-
|
||
|
- /* Resume remote flushes and then read tlb_gen. */
|
||
|
- cpumask_set_cpu(cpu, mm_cpumask(next));
|
||
|
- next_tlb_gen = atomic64_read(&next->context.tlb_gen);
|
||
|
-
|
||
|
- if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
|
||
|
- next_tlb_gen) {
|
||
|
- /*
|
||
|
- * Ideally, we'd have a flush_tlb() variant that
|
||
|
- * takes the known CR3 value as input. This would
|
||
|
- * be faster on Xen PV and on hypothetical CPUs
|
||
|
- * on which INVPCID is fast.
|
||
|
- */
|
||
|
- this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
|
||
|
- next_tlb_gen);
|
||
|
- write_cr3(build_cr3(next, prev_asid));
|
||
|
-
|
||
|
- /*
|
||
|
- * This gets called via leave_mm() in the idle path
|
||
|
- * where RCU functions differently. Tracing normally
|
||
|
- * uses RCU, so we have to call the tracepoint
|
||
|
- * specially here.
|
||
|
- */
|
||
|
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
|
||
|
- TLB_FLUSH_ALL);
|
||
|
- }
|
||
|
-
|
||
|
/*
|
||
|
- * We just exited lazy mode, which means that CR4 and/or LDTR
|
||
|
- * may be stale. (Changes to the required CR4 and LDTR states
|
||
|
- * are not reflected in tlb_gen.)
|
||
|
+ * We don't currently support having a real mm loaded without
|
||
|
+ * our cpu set in mm_cpumask(). We have all the bookkeeping
|
||
|
+ * in place to figure out whether we would need to flush
|
||
|
+ * if our cpu were cleared in mm_cpumask(), but we don't
|
||
|
+ * currently use it.
|
||
|
*/
|
||
|
+ if (WARN_ON_ONCE(real_prev != &init_mm &&
|
||
|
+ !cpumask_test_cpu(cpu, mm_cpumask(next))))
|
||
|
+ cpumask_set_cpu(cpu, mm_cpumask(next));
|
||
|
+
|
||
|
+ return;
|
||
|
} else {
|
||
|
u16 new_asid;
|
||
|
bool need_flush;
|
||
|
@@ -204,10 +178,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
|
||
|
}
|
||
|
|
||
|
/* Stop remote flushes for the previous mm */
|
||
|
- if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
|
||
|
- cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
|
||
|
-
|
||
|
- VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
|
||
|
+ VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
|
||
|
+ real_prev != &init_mm);
|
||
|
+ cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
|
||
|
|
||
|
/*
|
||
|
* Start remote flushes and then read tlb_gen.
|
||
|
@@ -237,6 +210,37 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
|
||
|
switch_ldt(real_prev, next);
|
||
|
}
|
||
|
|
||
|
+/*
|
||
|
+ * enter_lazy_tlb() is a hint from the scheduler that we are entering a
|
||
|
+ * kernel thread or other context without an mm. Acceptable implementations
|
||
|
+ * include doing nothing whatsoever, switching to init_mm, or various clever
|
||
|
+ * lazy tricks to try to minimize TLB flushes.
|
||
|
+ *
|
||
|
+ * The scheduler reserves the right to call enter_lazy_tlb() several times
|
||
|
+ * in a row. It will notify us that we're going back to a real mm by
|
||
|
+ * calling switch_mm_irqs_off().
|
||
|
+ */
|
||
|
+void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
|
||
|
+{
|
||
|
+ if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
|
||
|
+ return;
|
||
|
+
|
||
|
+ if (static_branch_unlikely(&tlb_use_lazy_mode)) {
|
||
|
+ /*
|
||
|
+ * There's a significant optimization that may be possible
|
||
|
+ * here. We have accurate enough TLB flush tracking that we
|
||
|
+ * don't need to maintain coherence of TLB per se when we're
|
||
|
+ * lazy. We do, however, need to maintain coherence of
|
||
|
+ * paging-structure caches. We could, in principle, leave our
|
||
|
+ * old mm loaded and only switch to init_mm when
|
||
|
+ * tlb_remove_page() happens.
|
||
|
+ */
|
||
|
+ this_cpu_write(cpu_tlbstate.is_lazy, true);
|
||
|
+ } else {
|
||
|
+ switch_mm(NULL, &init_mm, NULL);
|
||
|
+ }
|
||
|
+}
|
||
|
+
|
||
|
/*
|
||
|
* Call this when reinitializing a CPU. It fixes the following potential
|
||
|
* problems:
|
||
|
@@ -308,16 +312,20 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
|
||
|
/* This code cannot presently handle being reentered. */
|
||
|
VM_WARN_ON(!irqs_disabled());
|
||
|
|
||
|
+ if (unlikely(loaded_mm == &init_mm))
|
||
|
+ return;
|
||
|
+
|
||
|
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
|
||
|
loaded_mm->context.ctx_id);
|
||
|
|
||
|
- if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
|
||
|
+ if (this_cpu_read(cpu_tlbstate.is_lazy)) {
|
||
|
/*
|
||
|
- * We're in lazy mode -- don't flush. We can get here on
|
||
|
- * remote flushes due to races and on local flushes if a
|
||
|
- * kernel thread coincidentally flushes the mm it's lazily
|
||
|
- * still using.
|
||
|
+ * We're in lazy mode. We need to at least flush our
|
||
|
+ * paging-structure cache to avoid speculatively reading
|
||
|
+ * garbage into our TLB. Since switching to init_mm is barely
|
||
|
+ * slower than a minimal flush, just switch to init_mm.
|
||
|
*/
|
||
|
+ switch_mm_irqs_off(NULL, &init_mm, NULL);
|
||
|
return;
|
||
|
}
|
||
|
|
||
|
@@ -616,3 +624,57 @@ static int __init create_tlb_single_page_flush_ceiling(void)
|
||
|
return 0;
|
||
|
}
|
||
|
late_initcall(create_tlb_single_page_flush_ceiling);
|
||
|
+
|
||
|
+static ssize_t tlblazy_read_file(struct file *file, char __user *user_buf,
|
||
|
+ size_t count, loff_t *ppos)
|
||
|
+{
|
||
|
+ char buf[2];
|
||
|
+
|
||
|
+ buf[0] = static_branch_likely(&tlb_use_lazy_mode) ? '1' : '0';
|
||
|
+ buf[1] = '\n';
|
||
|
+
|
||
|
+ return simple_read_from_buffer(user_buf, count, ppos, buf, 2);
|
||
|
+}
|
||
|
+
|
||
|
+static ssize_t tlblazy_write_file(struct file *file,
|
||
|
+ const char __user *user_buf, size_t count, loff_t *ppos)
|
||
|
+{
|
||
|
+ bool val;
|
||
|
+
|
||
|
+ if (kstrtobool_from_user(user_buf, count, &val))
|
||
|
+ return -EINVAL;
|
||
|
+
|
||
|
+ if (val)
|
||
|
+ static_branch_enable(&tlb_use_lazy_mode);
|
||
|
+ else
|
||
|
+ static_branch_disable(&tlb_use_lazy_mode);
|
||
|
+
|
||
|
+ return count;
|
||
|
+}
|
||
|
+
|
||
|
+static const struct file_operations fops_tlblazy = {
|
||
|
+ .read = tlblazy_read_file,
|
||
|
+ .write = tlblazy_write_file,
|
||
|
+ .llseek = default_llseek,
|
||
|
+};
|
||
|
+
|
||
|
+static int __init init_tlb_use_lazy_mode(void)
|
||
|
+{
|
||
|
+ if (boot_cpu_has(X86_FEATURE_PCID)) {
|
||
|
+ /*
|
||
|
+ * Heuristic: with PCID on, switching to and from
|
||
|
+ * init_mm is reasonably fast, but remote flush IPIs
|
||
|
+ * as expensive as ever, so turn off lazy TLB mode.
|
||
|
+ *
|
||
|
+ * We can't do this in setup_pcid() because static keys
|
||
|
+ * haven't been initialized yet, and it would blow up
|
||
|
+ * badly.
|
||
|
+ */
|
||
|
+ static_branch_disable(&tlb_use_lazy_mode);
|
||
|
+ }
|
||
|
+
|
||
|
+ debugfs_create_file("tlb_use_lazy_mode", S_IRUSR | S_IWUSR,
|
||
|
+ arch_debugfs_dir, NULL, &fops_tlblazy);
|
||
|
+ return 0;
|
||
|
+}
|
||
|
+late_initcall(init_tlb_use_lazy_mode);
|
||
|
--
|
||
|
2.14.2
|
||
|
|