pve-kernel-qoup/patches/kernel/0045-x86-mm-Flush-more-aggressively-in-lazy-TLB-mode.patch

From d1ffadc67e2eee2d5f8626dca6646e70e3aa9d76 Mon Sep 17 00:00:00 2001
From: Andy Lutomirski <luto@kernel.org>
Date: Mon, 9 Oct 2017 09:50:49 -0700
Subject: [PATCH 045/242] x86/mm: Flush more aggressively in lazy TLB mode
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CVE-2017-5754

Since commit:

  94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")

x86's lazy TLB mode has been all the way lazy: when running a kernel thread
(including the idle thread), the kernel keeps using the last user mm's
page tables without attempting to maintain user TLB coherence at all.

From a pure semantic perspective, this is fine -- kernel threads won't
attempt to access user pages, so having stale TLB entries doesn't matter.

Unfortunately, I forgot about a subtlety.  By skipping TLB flushes,
we also allow any paging-structure caches that may exist on the CPU
to become incoherent.  This means that we can have a
paging-structure cache entry that references a freed page table, and
the CPU is within its rights to do a speculative page walk starting
at the freed page table.

I can imagine this causing two different problems:

 - A speculative page walk starting from a bogus page table could read
   IO addresses.  I haven't seen any reports of this causing problems.

 - A speculative page walk that involves a bogus page table can install
   garbage in the TLB.  Such garbage would always be at a user VA, but
   some AMD CPUs have logic that triggers a machine check when it notices
   these bogus entries.  I've seen a couple reports of this.

Boris further explains the failure mode:

> It is actually more of an optimization which assumes that paging-structure
> entries are in WB DRAM:
>
> "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
> performance optimization that assumes PML4, PDP, PDE, and PTE entries
> are in cacheable WB-DRAM; memory type checks may be bypassed, and
> addresses outside of WB-DRAM may result in undefined behavior or NB
> protocol errors. 1=Disables performance optimization and allows PML4,
> PDP, PDE and PTE entries to be in any memory type. Operating systems
> that maintain page tables in memory types other than WB- DRAM must set
> TlbCacheDis to insure proper operation."
>
> The MCE generated is an NB protocol error to signal that
>
> "Link: A specific coherent-only packet from a CPU was issued to an
> IO link. This may be caused by software which addresses page table
> structures in a memory type other than cacheable WB-DRAM without
> properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
> example, when page table structure addresses are above top of memory. In
> such cases, the NB will generate an MCE if it sees a mismatch between
> the memory operation generated by the core and the link type."
>
> I'm assuming coherent-only packets don't go out on IO links, thus the
> error.

To fix this, reinstate TLB coherence in lazy mode.  With this patch
applied, we do it in one of two ways:

 - If we have PCID, we simply switch back to init_mm's page tables
   when we enter a kernel thread -- this seems to be quite cheap
   except for the cost of serializing the CPU.

 - If we don't have PCID, then we set a flag and switch to init_mm
   the first time we would otherwise need to flush the TLB.

The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
to override the default mode for benchmarking.

In theory, we could optimize this better by only flushing the TLB in
lazy CPUs when a page table is freed.  Doing that would require
auditing the mm code to make sure that all page table freeing goes
through tlb_remove_page() as well as reworking some data structures
to implement the improved flush logic.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(backported from commit b956575bed91ecfb136a8300742ecbbf451471ab)
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
(cherry picked from commit a4bb9409c548ece51ec246fc5113a32b8d130142)
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
 arch/x86/include/asm/mmu_context.h |   8 +-
 arch/x86/include/asm/tlbflush.h    |  24 ++++++
 arch/x86/mm/tlb.c                  | 160 +++++++++++++++++++++++++------------
 3 files changed, 136 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index c120b5db178a..3c856a15b98e 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
 	DEBUG_LOCKS_WARN_ON(preemptible());
 }
 
-static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
-{
-	int cpu = smp_processor_id();
-
-	if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
-		cpumask_clear_cpu(cpu, mm_cpumask(mm));
-}
+void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index d23e61dc0640..6533da3036c9 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
 #endif
 
+/*
+ * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point
+ * to init_mm when we switch to a kernel thread (e.g. the idle thread).  If
+ * it's false, then we immediately switch CR3 when entering a kernel thread.
+ */
+DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
+
 /*
  * 6 because 6 should be plenty and struct tlb_state will fit in
  * two cache lines.
@@ -104,6 +111,23 @@ struct tlb_state {
 	u16 loaded_mm_asid;
 	u16 next_asid;
 
+	/*
+	 * We can be in one of several states:
+	 *
+	 *  - Actively using an mm.  Our CPU's bit will be set in
+	 *    mm_cpumask(loaded_mm) and is_lazy == false;
+	 *
+	 *  - Not using a real mm.  loaded_mm == &init_mm.  Our CPU's bit
+	 *    will not be set in mm_cpumask(&init_mm) and is_lazy == false.
+	 *
+	 *  - Lazily using a real mm.  loaded_mm != &init_mm, our bit
+	 *    is set in mm_cpumask(loaded_mm), but is_lazy == true.
+	 *    We're heuristically guessing that the CR3 load we
+	 *    skipped more than makes up for the overhead added by
+	 *    lazy mode.
+	 */
+	bool is_lazy;
+
 	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
 	 * disabling interrupts when modifying either one.
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 440400316c8a..b27aceaf7ed1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -30,6 +30,8 @@
 
 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
 
+DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
+
 static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 			    u16 *new_asid, bool *need_flush)
 {
@@ -80,7 +82,7 @@ void leave_mm(int cpu)
 		return;
 
 	/* Warn if we're not lazy. */
-	WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
+	WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
 
 	switch_mm(NULL, &init_mm, NULL);
 }
@@ -140,52 +142,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		__flush_tlb_all();
 	}
 #endif
+	this_cpu_write(cpu_tlbstate.is_lazy, false);
 
 	if (real_prev == next) {
 		VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
 			  next->context.ctx_id);
 
-		if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
-			/*
-			 * There's nothing to do: we weren't lazy, and we
-			 * aren't changing our mm.  We don't need to flush
-			 * anything, nor do we need to update CR3, CR4, or
-			 * LDTR.
-			 */
-			return;
-		}
-
-		/* Resume remote flushes and then read tlb_gen. */
-		cpumask_set_cpu(cpu, mm_cpumask(next));
-		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
-
-		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
-		    next_tlb_gen) {
-			/*
-			 * Ideally, we'd have a flush_tlb() variant that
-			 * takes the known CR3 value as input.  This would
-			 * be faster on Xen PV and on hypothetical CPUs
-			 * on which INVPCID is fast.
-			 */
-			this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
-				       next_tlb_gen);
-			write_cr3(build_cr3(next, prev_asid));
-
-			/*
-			 * This gets called via leave_mm() in the idle path
-			 * where RCU functions differently.  Tracing normally
-			 * uses RCU, so we have to call the tracepoint
-			 * specially here.
-			 */
-			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
-						TLB_FLUSH_ALL);
-		}
-
 		/*
-		 * We just exited lazy mode, which means that CR4 and/or LDTR
-		 * may be stale.  (Changes to the required CR4 and LDTR states
-		 * are not reflected in tlb_gen.)
+		 * We don't currently support having a real mm loaded without
+		 * our cpu set in mm_cpumask().  We have all the bookkeeping
+		 * in place to figure out whether we would need to flush
+		 * if our cpu were cleared in mm_cpumask(), but we don't
+		 * currently use it.
 		 */
+		if (WARN_ON_ONCE(real_prev != &init_mm &&
+				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
+			cpumask_set_cpu(cpu, mm_cpumask(next));
+
+		return;
 	} else {
 		u16 new_asid;
 		bool need_flush;
@@ -204,10 +178,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		}
 
 		/* Stop remote flushes for the previous mm */
-		if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
-			cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
-
-		VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
+		VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
+				real_prev != &init_mm);
+		cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
 
 		/*
 		 * Start remote flushes and then read tlb_gen.
@@ -237,6 +210,37 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	switch_ldt(real_prev, next);
 }
 
+/*
+ * enter_lazy_tlb() is a hint from the scheduler that we are entering a
+ * kernel thread or other context without an mm.  Acceptable implementations
+ * include doing nothing whatsoever, switching to init_mm, or various clever
+ * lazy tricks to try to minimize TLB flushes.
+ *
+ * The scheduler reserves the right to call enter_lazy_tlb() several times
+ * in a row.  It will notify us that we're going back to a real mm by
+ * calling switch_mm_irqs_off().
+ */
+void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
+{
+	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
+		return;
+
+	if (static_branch_unlikely(&tlb_use_lazy_mode)) {
+		/*
+		 * There's a significant optimization that may be possible
+		 * here.  We have accurate enough TLB flush tracking that we
+		 * don't need to maintain coherence of TLB per se when we're
+		 * lazy.  We do, however, need to maintain coherence of
+		 * paging-structure caches.  We could, in principle, leave our
+		 * old mm loaded and only switch to init_mm when
+		 * tlb_remove_page() happens.
+		 */
+		this_cpu_write(cpu_tlbstate.is_lazy, true);
+	} else {
+		switch_mm(NULL, &init_mm, NULL);
+	}
+}
+
 /*
  * Call this when reinitializing a CPU.  It fixes the following potential
  * problems:
@@ -308,16 +312,20 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	/* This code cannot presently handle being reentered. */
 	VM_WARN_ON(!irqs_disabled());
 
+	if (unlikely(loaded_mm == &init_mm))
+		return;
+
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
-	if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
+	if (this_cpu_read(cpu_tlbstate.is_lazy)) {
 		/*
-		 * We're in lazy mode -- don't flush.  We can get here on
-		 * remote flushes due to races and on local flushes if a
-		 * kernel thread coincidentally flushes the mm it's lazily
-		 * still using.
+		 * We're in lazy mode.  We need to at least flush our
+		 * paging-structure cache to avoid speculatively reading
+		 * garbage into our TLB.  Since switching to init_mm is barely
+		 * slower than a minimal flush, just switch to init_mm.
 		 */
+		switch_mm_irqs_off(NULL, &init_mm, NULL);
 		return;
 	}
 
@@ -616,3 +624,57 @@ static int __init create_tlb_single_page_flush_ceiling(void)
 	return 0;
 }
 late_initcall(create_tlb_single_page_flush_ceiling);
+
+static ssize_t tlblazy_read_file(struct file *file, char __user *user_buf,
+				 size_t count, loff_t *ppos)
+{
+	char buf[2];
+
+	buf[0] = static_branch_likely(&tlb_use_lazy_mode) ? '1' : '0';
+	buf[1] = '\n';
+
+	return simple_read_from_buffer(user_buf, count, ppos, buf, 2);
+}
+
+static ssize_t tlblazy_write_file(struct file *file,
+		 const char __user *user_buf, size_t count, loff_t *ppos)
+{
+	bool val;
+
+	if (kstrtobool_from_user(user_buf, count, &val))
+		return -EINVAL;
+
+	if (val)
+		static_branch_enable(&tlb_use_lazy_mode);
+	else
+		static_branch_disable(&tlb_use_lazy_mode);
+
+	return count;
+}
+
+static const struct file_operations fops_tlblazy = {
+	.read = tlblazy_read_file,
+	.write = tlblazy_write_file,
+	.llseek = default_llseek,
+};
+
+static int __init init_tlb_use_lazy_mode(void)
+{
+	if (boot_cpu_has(X86_FEATURE_PCID)) {
+		/*
+		 * Heuristic: with PCID on, switching to and from
+		 * init_mm is reasonably fast, but remote flush IPIs
+		 * as expensive as ever, so turn off lazy TLB mode.
+		 *
+		 * We can't do this in setup_pcid() because static keys
+		 * haven't been initialized yet, and it would blow up
+		 * badly.
+		 */
+		static_branch_disable(&tlb_use_lazy_mode);
+	}
+
+	debugfs_create_file("tlb_use_lazy_mode", S_IRUSR | S_IWUSR,
+			    arch_debugfs_dir, NULL, &fops_tlblazy);
+	return 0;
+}
+late_initcall(init_tlb_use_lazy_mode);
-- 
2.14.2
add KPTI and related patches picked from Ubuntu-4.13.0-23.26 2018-01-06 17:13:39 +03:00			`From d1ffadc67e2eee2d5f8626dca6646e70e3aa9d76 Mon Sep 17 00:00:00 2001`
			`From: Andy Lutomirski <luto@kernel.org>`
			`Date: Mon, 9 Oct 2017 09:50:49 -0700`
revert buggy SCSI error handler commit this causes kernel OOPS and upstream is unresponsive about it. see https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1726519 2018-01-08 13:50:09 +03:00			`Subject: [PATCH 045/242] x86/mm: Flush more aggressively in lazy TLB mode`
add KPTI and related patches picked from Ubuntu-4.13.0-23.26 2018-01-06 17:13:39 +03:00			`MIME-Version: 1.0`
			`Content-Type: text/plain; charset=UTF-8`
			`Content-Transfer-Encoding: 8bit`

			`CVE-2017-5754`

			`Since commit:`

			`94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")`

			`x86's lazy TLB mode has been all the way lazy: when running a kernel thread`
			`(including the idle thread), the kernel keeps using the last user mm's`
			`page tables without attempting to maintain user TLB coherence at all.`

			`From a pure semantic perspective, this is fine -- kernel threads won't`
			`attempt to access user pages, so having stale TLB entries doesn't matter.`

			`Unfortunately, I forgot about a subtlety. By skipping TLB flushes,`
			`we also allow any paging-structure caches that may exist on the CPU`
			`to become incoherent. This means that we can have a`
			`paging-structure cache entry that references a freed page table, and`
			`the CPU is within its rights to do a speculative page walk starting`
			`at the freed page table.`

			`I can imagine this causing two different problems:`

			`- A speculative page walk starting from a bogus page table could read`
			`IO addresses. I haven't seen any reports of this causing problems.`

			`- A speculative page walk that involves a bogus page table can install`
			`garbage in the TLB. Such garbage would always be at a user VA, but`
			`some AMD CPUs have logic that triggers a machine check when it notices`
			`these bogus entries. I've seen a couple reports of this.`

			`Boris further explains the failure mode:`

			`> It is actually more of an optimization which assumes that paging-structure`
			`> entries are in WB DRAM:`
			`>`
			`> "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables`
			`> performance optimization that assumes PML4, PDP, PDE, and PTE entries`
			`> are in cacheable WB-DRAM; memory type checks may be bypassed, and`
			`> addresses outside of WB-DRAM may result in undefined behavior or NB`
			`> protocol errors. 1=Disables performance optimization and allows PML4,`
			`> PDP, PDE and PTE entries to be in any memory type. Operating systems`
			`> that maintain page tables in memory types other than WB- DRAM must set`
			`> TlbCacheDis to insure proper operation."`
			`>`
			`> The MCE generated is an NB protocol error to signal that`
			`>`
			`> "Link: A specific coherent-only packet from a CPU was issued to an`
			`> IO link. This may be caused by software which addresses page table`
			`> structures in a memory type other than cacheable WB-DRAM without`
			`> properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for`
			`> example, when page table structure addresses are above top of memory. In`
			`> such cases, the NB will generate an MCE if it sees a mismatch between`
			`> the memory operation generated by the core and the link type."`
			`>`
			`> I'm assuming coherent-only packets don't go out on IO links, thus the`
			`> error.`

			`To fix this, reinstate TLB coherence in lazy mode. With this patch`
			`applied, we do it in one of two ways:`

			`- If we have PCID, we simply switch back to init_mm's page tables`
			`when we enter a kernel thread -- this seems to be quite cheap`
			`except for the cost of serializing the CPU.`

			`- If we don't have PCID, then we set a flag and switch to init_mm`
			`the first time we would otherwise need to flush the TLB.`

			`The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed`
			`to override the default mode for benchmarking.`

			`In theory, we could optimize this better by only flushing the TLB in`
			`lazy CPUs when a page table is freed. Doing that would require`
			`auditing the mm code to make sure that all page table freeing goes`
			`through tlb_remove_page() as well as reworking some data structures`
			`to implement the improved flush logic.`

			`Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>`
			`Reported-by: Adam Borowski <kilobyte@angband.pl>`
			`Signed-off-by: Andy Lutomirski <luto@kernel.org>`
			`Signed-off-by: Borislav Petkov <bp@suse.de>`
			`Cc: Borislav Petkov <bp@alien8.de>`
			`Cc: Brian Gerst <brgerst@gmail.com>`
			`Cc: Daniel Borkmann <daniel@iogearbox.net>`
			`Cc: Eric Biggers <ebiggers@google.com>`
			`Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>`
			`Cc: Kees Cook <keescook@chromium.org>`
			`Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>`
			`Cc: Linus Torvalds <torvalds@linux-foundation.org>`
			`Cc: Nadav Amit <nadav.amit@gmail.com>`
			`Cc: Peter Zijlstra <peterz@infradead.org>`
			`Cc: Rik van Riel <riel@redhat.com>`
			`Cc: Roman Kagan <rkagan@virtuozzo.com>`
			`Cc: Thomas Gleixner <tglx@linutronix.de>`
			`Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")`
			`Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic`
			`Signed-off-by: Ingo Molnar <mingo@kernel.org>`
			`(backported from commit b956575bed91ecfb136a8300742ecbbf451471ab)`
			`Signed-off-by: Andy Whitcroft <apw@canonical.com>`
			`Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>`
			`(cherry picked from commit a4bb9409c548ece51ec246fc5113a32b8d130142)`
			`Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>`
			`---`
			`arch/x86/include/asm/mmu_context.h \| 8 +-`
			`arch/x86/include/asm/tlbflush.h \| 24 ++++++`
			`arch/x86/mm/tlb.c \| 160 +++++++++++++++++++++++++------------`
			`3 files changed, 136 insertions(+), 56 deletions(-)`

			`diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h`
			`index c120b5db178a..3c856a15b98e 100644`
			`--- a/arch/x86/include/asm/mmu_context.h`
			`+++ b/arch/x86/include/asm/mmu_context.h`
			`@@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct prev, struct mm_struct next)`
			`DEBUG_LOCKS_WARN_ON(preemptible());`
			`}`

			`-static inline void enter_lazy_tlb(struct mm_struct mm, struct task_struct tsk)`
			`-{`
			`- int cpu = smp_processor_id();`
			`-`
			`- if (cpumask_test_cpu(cpu, mm_cpumask(mm)))`
			`- cpumask_clear_cpu(cpu, mm_cpumask(mm));`
			`-}`
			`+void enter_lazy_tlb(struct mm_struct mm, struct task_struct tsk);`

			`static inline int init_new_context(struct task_struct *tsk,`
			`struct mm_struct *mm)`
			`diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h`
			`index d23e61dc0640..6533da3036c9 100644`
			`--- a/arch/x86/include/asm/tlbflush.h`
			`+++ b/arch/x86/include/asm/tlbflush.h`
			`@@ -82,6 +82,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)`
			`#define __flush_tlb_single(addr) __native_flush_tlb_single(addr)`
			`#endif`

			`+/*`
			`+ * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point`
			`+ * to init_mm when we switch to a kernel thread (e.g. the idle thread). If`
			`+ * it's false, then we immediately switch CR3 when entering a kernel thread.`
			`+ */`
			`+DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode);`
			`+`
			`/*`
			`* 6 because 6 should be plenty and struct tlb_state will fit in`
			`* two cache lines.`
			`@@ -104,6 +111,23 @@ struct tlb_state {`
			`u16 loaded_mm_asid;`
			`u16 next_asid;`

			`+ /*`
			`+ * We can be in one of several states:`
			`+ *`
			`+ * - Actively using an mm. Our CPU's bit will be set in`
			`+ * mm_cpumask(loaded_mm) and is_lazy == false;`
			`+ *`
			`+ * - Not using a real mm. loaded_mm == &init_mm. Our CPU's bit`
			`+ * will not be set in mm_cpumask(&init_mm) and is_lazy == false.`
			`+ *`
			`+ * - Lazily using a real mm. loaded_mm != &init_mm, our bit`
			`+ * is set in mm_cpumask(loaded_mm), but is_lazy == true.`
			`+ * We're heuristically guessing that the CR3 load we`
			`+ * skipped more than makes up for the overhead added by`
			`+ * lazy mode.`
			`+ */`
			`+ bool is_lazy;`
			`+`
			`/*`
			`* Access to this CR4 shadow and to H/W CR4 is protected by`
			`* disabling interrupts when modifying either one.`
			`diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c`
			`index 440400316c8a..b27aceaf7ed1 100644`
			`--- a/arch/x86/mm/tlb.c`
			`+++ b/arch/x86/mm/tlb.c`
			`@@ -30,6 +30,8 @@`

			`atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);`

			`+DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode);`
			`+`
			`static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,`
			`u16 new_asid, bool need_flush)`
			`{`
			`@@ -80,7 +82,7 @@ void leave_mm(int cpu)`
			`return;`

			`/* Warn if we're not lazy. */`
			`- WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));`
			`+ WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));`

			`switch_mm(NULL, &init_mm, NULL);`
			`}`
			`@@ -140,52 +142,24 @@ void switch_mm_irqs_off(struct mm_struct prev, struct mm_struct next,`
			`__flush_tlb_all();`
			`}`
			`#endif`
			`+ this_cpu_write(cpu_tlbstate.is_lazy, false);`

			`if (real_prev == next) {`
			`VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=`
			`next->context.ctx_id);`

			`- if (cpumask_test_cpu(cpu, mm_cpumask(next))) {`
			`- /*`
			`- * There's nothing to do: we weren't lazy, and we`
			`- * aren't changing our mm. We don't need to flush`
			`- * anything, nor do we need to update CR3, CR4, or`
			`- * LDTR.`
			`- */`
			`- return;`
			`- }`
			`-`
			`- /* Resume remote flushes and then read tlb_gen. */`
			`- cpumask_set_cpu(cpu, mm_cpumask(next));`
			`- next_tlb_gen = atomic64_read(&next->context.tlb_gen);`
			`-`
			`- if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <`
			`- next_tlb_gen) {`
			`- /*`
			`- * Ideally, we'd have a flush_tlb() variant that`
			`- * takes the known CR3 value as input. This would`
			`- * be faster on Xen PV and on hypothetical CPUs`
			`- * on which INVPCID is fast.`
			`- */`
			`- this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,`
			`- next_tlb_gen);`
			`- write_cr3(build_cr3(next, prev_asid));`
			`-`
			`- /*`
			`- * This gets called via leave_mm() in the idle path`
			`- * where RCU functions differently. Tracing normally`
			`- * uses RCU, so we have to call the tracepoint`
			`- * specially here.`
			`- */`
			`- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,`
			`- TLB_FLUSH_ALL);`
			`- }`
			`-`
			`/*`
			`- * We just exited lazy mode, which means that CR4 and/or LDTR`
			`- * may be stale. (Changes to the required CR4 and LDTR states`
			`- * are not reflected in tlb_gen.)`
			`+ * We don't currently support having a real mm loaded without`
			`+ * our cpu set in mm_cpumask(). We have all the bookkeeping`
			`+ * in place to figure out whether we would need to flush`
			`+ * if our cpu were cleared in mm_cpumask(), but we don't`
			`+ * currently use it.`
			`*/`
			`+ if (WARN_ON_ONCE(real_prev != &init_mm &&`
			`+ !cpumask_test_cpu(cpu, mm_cpumask(next))))`
			`+ cpumask_set_cpu(cpu, mm_cpumask(next));`
			`+`
			`+ return;`
			`} else {`
			`u16 new_asid;`
			`bool need_flush;`
			`@@ -204,10 +178,9 @@ void switch_mm_irqs_off(struct mm_struct prev, struct mm_struct next,`
			`}`

			`/* Stop remote flushes for the previous mm */`
			`- if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))`
			`- cpumask_clear_cpu(cpu, mm_cpumask(real_prev));`
			`-`
			`- VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));`
			`+ VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&`
			`+ real_prev != &init_mm);`
			`+ cpumask_clear_cpu(cpu, mm_cpumask(real_prev));`

			`/*`
			`* Start remote flushes and then read tlb_gen.`
			`@@ -237,6 +210,37 @@ void switch_mm_irqs_off(struct mm_struct prev, struct mm_struct next,`
			`switch_ldt(real_prev, next);`
			`}`

			`+/*`
			`+ * enter_lazy_tlb() is a hint from the scheduler that we are entering a`
			`+ * kernel thread or other context without an mm. Acceptable implementations`
			`+ * include doing nothing whatsoever, switching to init_mm, or various clever`
			`+ * lazy tricks to try to minimize TLB flushes.`
			`+ *`
			`+ * The scheduler reserves the right to call enter_lazy_tlb() several times`
			`+ * in a row. It will notify us that we're going back to a real mm by`
			`+ * calling switch_mm_irqs_off().`
			`+ */`
			`+void enter_lazy_tlb(struct mm_struct mm, struct task_struct tsk)`
			`+{`
			`+ if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)`
			`+ return;`
			`+`
			`+ if (static_branch_unlikely(&tlb_use_lazy_mode)) {`
			`+ /*`
			`+ * There's a significant optimization that may be possible`
			`+ * here. We have accurate enough TLB flush tracking that we`
			`+ * don't need to maintain coherence of TLB per se when we're`
			`+ * lazy. We do, however, need to maintain coherence of`
			`+ * paging-structure caches. We could, in principle, leave our`
			`+ * old mm loaded and only switch to init_mm when`
			`+ * tlb_remove_page() happens.`
			`+ */`
			`+ this_cpu_write(cpu_tlbstate.is_lazy, true);`
			`+ } else {`
			`+ switch_mm(NULL, &init_mm, NULL);`
			`+ }`
			`+}`
			`+`
			`/*`
			`* Call this when reinitializing a CPU. It fixes the following potential`
			`* problems:`
			`@@ -308,16 +312,20 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,`
			`/* This code cannot presently handle being reentered. */`
			`VM_WARN_ON(!irqs_disabled());`

			`+ if (unlikely(loaded_mm == &init_mm))`
			`+ return;`
			`+`
			`VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=`
			`loaded_mm->context.ctx_id);`

			`- if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {`
			`+ if (this_cpu_read(cpu_tlbstate.is_lazy)) {`
			`/*`
			`- * We're in lazy mode -- don't flush. We can get here on`
			`- * remote flushes due to races and on local flushes if a`
			`- * kernel thread coincidentally flushes the mm it's lazily`
			`- * still using.`
			`+ * We're in lazy mode. We need to at least flush our`
			`+ * paging-structure cache to avoid speculatively reading`
			`+ * garbage into our TLB. Since switching to init_mm is barely`
			`+ * slower than a minimal flush, just switch to init_mm.`
			`*/`
			`+ switch_mm_irqs_off(NULL, &init_mm, NULL);`
			`return;`
			`}`

			`@@ -616,3 +624,57 @@ static int __init create_tlb_single_page_flush_ceiling(void)`
			`return 0;`
			`}`
			`late_initcall(create_tlb_single_page_flush_ceiling);`
			`+`
			`+static ssize_t tlblazy_read_file(struct file file, char __user user_buf,`
			`+ size_t count, loff_t *ppos)`
			`+{`
			`+ char buf[2];`
			`+`
			`+ buf[0] = static_branch_likely(&tlb_use_lazy_mode) ? '1' : '0';`
			`+ buf[1] = '\n';`
			`+`
			`+ return simple_read_from_buffer(user_buf, count, ppos, buf, 2);`
			`+}`
			`+`
			`+static ssize_t tlblazy_write_file(struct file *file,`
			`+ const char __user user_buf, size_t count, loff_t ppos)`
			`+{`
			`+ bool val;`
			`+`
			`+ if (kstrtobool_from_user(user_buf, count, &val))`
			`+ return -EINVAL;`
			`+`
			`+ if (val)`
			`+ static_branch_enable(&tlb_use_lazy_mode);`
			`+ else`
			`+ static_branch_disable(&tlb_use_lazy_mode);`
			`+`
			`+ return count;`
			`+}`
			`+`
			`+static const struct file_operations fops_tlblazy = {`
			`+ .read = tlblazy_read_file,`
			`+ .write = tlblazy_write_file,`
			`+ .llseek = default_llseek,`
			`+};`
			`+`
			`+static int __init init_tlb_use_lazy_mode(void)`
			`+{`
			`+ if (boot_cpu_has(X86_FEATURE_PCID)) {`
			`+ /*`
			`+ * Heuristic: with PCID on, switching to and from`
			`+ * init_mm is reasonably fast, but remote flush IPIs`
			`+ * as expensive as ever, so turn off lazy TLB mode.`
			`+ *`
			`+ * We can't do this in setup_pcid() because static keys`
			`+ * haven't been initialized yet, and it would blow up`
			`+ * badly.`
			`+ */`
			`+ static_branch_disable(&tlb_use_lazy_mode);`
			`+ }`
			`+`
			`+ debugfs_create_file("tlb_use_lazy_mode", S_IRUSR \| S_IWUSR,`
			`+ arch_debugfs_dir, NULL, &fops_tlblazy);`
			`+ return 0;`
			`+}`
			`+late_initcall(init_tlb_use_lazy_mode);`
			`--`
			`2.14.2`