{"id":3073,"date":"2018-11-25T10:44:59","date_gmt":"2018-11-25T10:44:59","guid":{"rendered":"http:\/\/mertboru.com\/?p=3073"},"modified":"2019-05-13T07:51:29","modified_gmt":"2019-05-13T07:51:29","slug":"taming-a-beast-cache","status":"publish","type":"post","link":"http:\/\/mertboru.com\/?p=3073","title":{"rendered":"Taming a Beast: CPU Cache"},"content":{"rendered":"<p style=\"text-align: right;\"><span style=\"color: #999999;\"><em>(Cover Photo:\u00a0 \u00a9 Granger &#8211; &#8220;Lion Tamer&#8221;<\/em><\/span><br \/>\n<span style=\"color: #999999;\"><em>The American animal tamer Clyde Beatty<\/em><\/span><br \/>\n<span style=\"color: #999999;\"><em>performing in the 1930s.)<\/em><\/span><\/p>\n<p><strong>The processor&#8217;s caches are for the most part transparent to software. When enabled, instructions and data flow through these caches without the need for explicit software control. However, knowledge of the behavior of these caches may be useful in optimizing software performance. If not tamed wisely, these innocent cache mechanisms can certainly be a headache for novice C\/C++ programmers.<\/strong><\/p>\n<p>First things first\u2026 Before I start with example C\/C++ codes showing some common pitfalls and urban caching myths that lead to hard-to-trace bugs, I would like to make sure that we are all comfortable with <em>&#8216;cache related terms&#8217;<\/em>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-1658\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_endof.png\" alt=\"\" width=\"27\" height=\"27\" \/><\/p>\n<p><strong>Terminology<\/strong><\/p>\n<p>In theory, CPU <strong>cache<\/strong> is a very high speed type of <em>memory<\/em> that is placed between the CPU and the main memory. (In practice, it is actually <em>inside<\/em> the processor, mostly operating at the speed of the CPU.) In order to improve latency of fetching information from the main memory, cache stores some of the information temporarily so that the next access to the same chunk of information is faster. CPU cache can store both <span style=\"color: #800000;\"><em><strong>&#8216;executable instructions&#8217;<\/strong><\/em><\/span> and <span style=\"color: #800000;\"><em><strong>&#8216;raw data&#8217;<\/strong><\/em><\/span>.<\/p>\n<blockquote class=\"alignright\">\n<p style=\"text-align: center;\"><strong>&#8220;&#8230; from cache, instead of going back to memory.&#8221;<\/strong><\/p>\n<\/blockquote>\n<p>When the processor recognizes that an information being read from memory is cacheable, the processor reads an entire cache line into the appropriate cache slot (L1, L2, L3, or all). This operation is called a <strong>cache line fill<\/strong>. If the memory location containing that information is still cached when the processor attempts to access to it again, the processor can read that information from the cache instead of going back to memory. This operation is called a <strong>cache hit<\/strong>.<\/p>\n<figure id=\"attachment_3140\" aria-describedby=\"caption-attachment-3140\" style=\"width: 2030px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_IntelCorei7Processors.jpg\" data-rel=\"lightbox-image-0\" data-rl_title=\"Cache Structure of the Intel Core i7 Processors\" data-rl_caption=\"\" title=\"Cache Structure of the Intel Core i7 Processors\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-3140 size-full\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_IntelCorei7Processors.jpg\" alt=\"\" width=\"2030\" height=\"1178\" srcset=\"http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_IntelCorei7Processors.jpg 2030w, http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_IntelCorei7Processors-300x174.jpg 300w, http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_IntelCorei7Processors-768x446.jpg 768w, http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_IntelCorei7Processors-1024x594.jpg 1024w\" sizes=\"auto, (max-width: 2030px) 100vw, 2030px\" \/><\/a><figcaption id=\"caption-attachment-3140\" class=\"wp-caption-text\">Hierarchical Cache Structure of the Intel Core i7 Processors<\/figcaption><\/figure>\n<p>When the processor attempts to write an information to a cacheable area of memory, it first checks if a cache line for that memory location exists in the cache. If a valid cache line does exist, the processor (depending on the write policy currently in force) can write that information into the cache instead of writing it out to system memory. This operation is called a <strong>write hit<\/strong>. If a write misses the cache (that is, a valid cache line is not present for area of memory being written to), the processor performs a cache line fill, <strong>write allocation<\/strong>. Then it writes the information into the cache line and (depending on the write policy currently in force) can also write it out to memory. If the information is to be written out to memory, it is written first into the store buffer, and then written from the store buffer to memory when the system bus is available.<\/p>\n<blockquote class=\"alignleft\">\n<p style=\"text-align: center;\"><strong>&#8220;&#8230; cached in shared state, between multiple CPUs.&#8221;<\/strong><\/p>\n<\/blockquote>\n<p>When operating in a <strong>multi-processor system<\/strong>, The Intel 64 and IA-32 architectures have the ability to keep their internal caches consistent both with system memory and with the caches in other processors on the bus. For example, if one processor detects that another processor intends to write to a memory location that it currently has cached in shared state, the processor in charge will invalidate its cache line forcing it to perform a cache line fill the next time it accesses the same memory location. This type of internal communication between the CPUs is called <strong>snooping<\/strong>.<\/p>\n<p>And finally, <strong>translation lookaside buffer (TLB)<\/strong> is a special type of cache designed for speeding up address translation for virtual memory related operations. It is a part of the chip\u2019s memory-management unit (MMU). TLB keeps track of where virtual pages are stored in physical memory, thus speeds up &#8216;virtual address to physical address&#8217; translation by storing a lookup page-table.<\/p>\n<p>So far so good&#8230; Let&#8217;s start coding, and shed some light on urban caching myths. \ud83d\ude09<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-1658 aligncenter\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_endof.png\" alt=\"\" width=\"27\" height=\"27\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><strong>How to Guarantee Caching in C\/C++<\/strong><\/p>\n<p>To be honest, under normal conditions, there is absolutely no way to guarantee that the variable you defined in C\/C++ will be cached. CPU cache and write buffer management are out of scope of the C\/C++ language, actually.<\/p>\n<p>Most programmers assume that declaring a variable as constant will automatically turn it into <em>something<\/em> cacheable!<\/p>\n<pre style=\"padding-left: 30px;\">const int nVar = 33;<\/pre>\n<p>As a matter of fact, doing so will tell the C\/C++ compiler that it is forbidden for the rest of the code to modify the variable&#8217;s value, which <strong><em><span style=\"color: #800000;\">may or may not<\/span><\/em><\/strong> lead to a cacheable case. By using a <strong>const<\/strong>, you simply increase the chance of getting it cached. In most cases, compiler will be able to turn it into a cache hit. However, we can never be sure about it unless we debug and trace the variable with our own eyes.<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-1658 aligncenter\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_endof.png\" alt=\"\" width=\"27\" height=\"27\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><strong>How to Guarantee No Caching in C\/C++<\/strong><\/p>\n<p>An urban myth states that, by using <strong>volatile<\/strong> type qualifier, it is possible to guarantee that a variable can never be cached. In other words, this myth assumes that it might be possible to disable CPU caching features for specific C\/C++ variables in your code!<\/p>\n<pre style=\"padding-left: 30px;\">volatile int nVar = 33;<\/pre>\n<p>Actually, defining a variable as <strong>volatile<\/strong> prevents compiler from optimizing it, and forces the compiler to always refetch (read once again) the value of that variable from memory. But, this <strong><span style=\"color: #800000;\"><em>may or may not <\/em><\/span><\/strong>prevent it from caching, as <strong>volatile<\/strong> has nothing to do with CPU caches and write buffers, and there is no standard support for these features in C\/C++.<\/p>\n<p>So, what happens if we declare the same variable without <strong>const<\/strong> or <strong>volatile<\/strong>?<\/p>\n<pre style=\"padding-left: 30px;\">int nVar = 33;<\/pre>\n<p>Well, in most cases, your code will be executed and cached properly. (Still not guaranteed though.) But, one thing for sure&#8230; If you write <span style=\"color: #800000;\"><strong><em>&#8216;weird&#8217;<\/em><\/strong><\/span> code, like the following one, then you are asking for trouble!<\/p>\n<pre style=\"padding-left: 30px;\">int nVar = 33;\r\nwhile (nVar == 33)\r\n{\r\n\u00a0\u00a0 . . .\r\n}<\/pre>\n<p>In this case, if the optimization is enabled, C\/C++ compiler may assume that <strong>nVar<\/strong> never changes (always set to 33) due to no reference of <strong>nVar<\/strong> in loop&#8217;s body, so that it can be replaced with <strong>true<\/strong> for the sake of optimizing <strong>while<\/strong> condition.<\/p>\n<pre style=\"padding-left: 30px;\">while (true)\r\n{\r\n\u00a0\u00a0 . . .\r\n}<\/pre>\n<p>A simple <strong>volatile<\/strong> type qualifier fixes the problem, actually.<\/p>\n<pre style=\"padding-left: 30px;\">volatile int nVar = 33;\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-1658 aligncenter\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_endof.png\" alt=\"\" width=\"27\" height=\"27\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><strong>What about Pointers?<\/strong><\/p>\n<p>Well, handling pointers is no different than taking care of simple integers.<\/p>\n<p><strong>Case #1:<\/strong><\/p>\n<p>Let&#8217;s try to evaluate the <strong>while<\/strong> case mentioned above once again, but this time with a Pointer.<\/p>\n<pre style=\"padding-left: 30px;\">int nVar = 33;\r\nint *pVar = (int*) &amp;nVar;\r\nwhile (*pVar)\r\n{\r\n   . . .\r\n}<\/pre>\n<p>In this case,<\/p>\n<p style=\"padding-left: 30px;\"><strong><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 nVar<\/strong> is declared as an integer with an initial value of 33,<br \/>\n<strong><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 pVar<\/strong> is assigned as a Pointer to <strong>nVar<\/strong>,<br \/>\n<strong><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 <\/strong>the value of <strong>nVar<\/strong> (33) is gathered using pointer <strong>pVar<\/strong>, and this value is used as a conditional statement in <strong>while<\/strong> loop.<\/p>\n<p>On the surface there is nothing wrong with this code, but if aggressive C\/C++ compiler optimizations are enabled, then we <span style=\"color: #800000;\"><strong><em>might be<\/em><\/strong><\/span> in trouble. &#8211; <em>Yes, some compilers are smarter than others!<\/em> \ud83d\ude09<\/p>\n<p>Due to fact that the value of pointer variable has never been modified and\/or accessed through the <strong>while<\/strong> loop, compiler may decide to optimize the frequently called conditional statement of the loop. Instead of fetching <strong>*pVar<\/strong> (value of nVar) each time from the memory, compiler might think that keeping this value in a <strong>register<\/strong> might be a good idea. This is known as <span style=\"color: #800000;\"><em><strong>&#8216;software caching&#8217;<\/strong><\/em><\/span>.<\/p>\n<p>Now, we have two problems here:<\/p>\n<p style=\"padding-left: 30px;\"><strong>1.)<\/strong> Values in <strong>registers<\/strong> are <span style=\"color: #800000;\"><em><strong>&#8216;hardware cached&#8217;<\/strong><\/em><\/span>. (CPU cache can store both <strong><em>instructions<\/em><\/strong> and <strong><em>data<\/em><\/strong>, remember?) If somehow, software cached value in the register goes out of sync with the original one in memory, the CPU will never be aware of this situation and will keep on caching the old value from hardware cache. &#8211; CPU cache vs software cache. What a mess!<\/p>\n<div style=\"border: 1px solid red; padding: 25px; margin: 25px;\"><strong>Tip:<\/strong> Is that scenario really possible?! &#8211; To be honest, no. During the compilation process, the C\/C++ compiler should be clever enough to foresee that problem, <strong><span style=\"color: #800000;\"><em>if-and-only-if<\/em><\/span><\/strong>\u00a0<strong>*pVar<\/strong> has never been modified in loop&#8217;s body. However, as a programmer, it is our responsibility to make sure that compiler should be given <strong><em>&#8216;properly written code&#8217;<\/em><\/strong> with no ambiguous logic\/data treatment. So, instead of keeping our fingers crossed and expecting miracles from the compiler, we should take complete control over the direction of our code. Before making assumptions on how our code will be compiled, we should first make sure that our code is crystal clear.<\/div>\n<p style=\"padding-left: 30px;\"><strong>2.)<\/strong> Since the value of <strong>nVar<\/strong> has never\u00a0been modified, the compiler can even go one step further by assuming that the check against <strong>*pVar <\/strong>can be casted to a Boolean value, due to its usage as a conditional statement. As a result of this optimization, the code above might turn into this:<\/p>\n<pre style=\"padding-left: 30px;\">int nVar = 33;\r\nint *pVar = (int*) &amp;nVar;\r\n\r\nif (*pVar)\r\n{\r\n   while (true)\r\n   {\r\n      . . .\r\n   }\r\n}<\/pre>\n<p>Both problems detailed above, can be fixed by using a <strong>volatile<\/strong> type qualifier. Doing so prevents the compiler from optimizing <strong>*pVar<\/strong>, and forces the compiler to always refetch the value from memory, rather than using a compiler-generated <span style=\"color: #800000;\"><em><strong>software cached<\/strong><\/em><\/span> version in registers.<\/p>\n<pre style=\"padding-left: 30px;\">int nVar = 33;\r\nvolatile int *pVar = (int*) &amp;nVar;\r\nwhile (*pVar)\r\n{\r\n   . . .\r\n}<\/pre>\n<p><strong>Case #2:<\/strong><\/p>\n<p>Here comes an another tricky example about Pointers.<\/p>\n<pre style=\"padding-left: 30px;\">const int nVar = 33;\r\nint *pVar = (int*) &amp;nVar;\r\n*pVar = 0;<\/pre>\n<p>In this case,<\/p>\n<p style=\"padding-left: 30px;\"><strong><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 nVar<\/strong> is declared as a <em>&#8216;constant&#8217;<\/em> variable,<br \/>\n<strong><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 pVar<\/strong> is assigned as a Pointer to <strong>nVar<\/strong>,<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 and, <strong>pVar<\/strong> is trying to change the <em>&#8216;constant&#8217;<\/em> value of <strong>nVar<\/strong>!<\/p>\n<p>Under normal conditions, no C\/C++ programmer would make such a mistake, but for the sake of clarity let&#8217;s assume that we did.<\/p>\n<p>If aggressive optimization is enabled, due to fact that;<\/p>\n<p style=\"padding-left: 30px;\"><strong>a.)<\/strong> Pointer variable points to a constant variable,<\/p>\n<p style=\"padding-left: 30px;\"><strong>b.)<\/strong> Value of pointer variable has never been modified and\/or accessed,<\/p>\n<p>some compilers may assume that the pointer can be optimized <span style=\"color: #800000;\"><strong><em>for the sake of software caching<\/em><\/strong><\/span>. So, despite<strong> *pVar = 0<\/strong>, the value of <strong>nVar<\/strong> <em>may<\/em> never change.<\/p>\n<p>Is that all? Well, no\u2026 Here comes the worst part! The value of <strong>nVar<\/strong> is actually <span style=\"color: #800000;\"><strong><em>compiler dependent<\/em><\/strong><\/span>. If you compile the code above with a bunch of different C\/C++ compilers, you will notice that in some of them <strong>nVar<\/strong> will be set to 0, and in some others set to 33 as a result of <span style=\"color: #800000;\"><strong><em>&#8216;ambiguous&#8217;<\/em><\/strong><\/span> code compilation\/execution. Why? Simply because, every compiler has its own standards when it comes to generating code for &#8216;constant&#8217; variables. As a result of this inconsistent situation, even with just a single constant variable, things can easily get very complicated.<\/p>\n<div style=\"border: 1px solid red; padding: 25px; margin: 25px;\"><strong>Tip:<\/strong> The best way to fix &#8216;cache oriented compiler optimization issues&#8217;, is to change the way you write code, with respect to tricky compiler specific optimizations in mind. Try to write crystal clear code. Never assume that compiler knows programming better than you. Always debug, trace, and check the output&#8230; Be prepared for the unexpected!<\/div>\n<p>Fixing such brute-force compiler optimization issues is quite easy. You can get rid of <strong>const\u00a0<\/strong>type qualifier,<\/p>\n<pre style=\"padding-left: 30px;\"><del>const<\/del> int nVar = 33;<\/pre>\n<p>or, replace <strong>const<\/strong> with <strong>volatile\u00a0<\/strong>type qualifier,<\/p>\n<pre style=\"padding-left: 30px;\">volatile int nVar = 33;<\/pre>\n<p>or, use both!<\/p>\n<pre style=\"padding-left: 30px;\">const volatile int nVar = 33;<\/pre>\n<div style=\"border: 1px solid red; padding: 25px; margin: 25px;\"><strong>Tip:<\/strong> &#8216;const volatile&#8217; combination is commonly used on embedded systems, where <em><strong>hardware registers<\/strong><\/em> that can be read and are updated by the hardware, cannot be altered by software. In such cases, reading hardware register&#8217;s value is never cached, always refetched from memory.<\/div>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-1658 aligncenter\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_endof.png\" alt=\"\" width=\"27\" height=\"27\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><strong>Rule of Thumb<\/strong><\/p>\n<p>Using <strong>volatile<\/strong> is <strong><span style=\"color: #800000;\"><em>absolutely necessary<\/em><\/span><\/strong> in any situation where compiler could make wrong assumptions about a variable keeping its value constant, just because a function does not change it itself. Not using <strong>volatile<\/strong> would create very complicated bugs due to the executed code that behaves as if the value did not change &#8211; <em>(It did, indeed)<\/em>.<\/p>\n<p>If code that works fine, somehow fails when you;<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 Use cross compilers,<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 Port code to a different compiler,<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 Enable compiler optimizations,<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 Enable interrupts,<\/p>\n<p>make sure that your compiler is NOT over-optimizing variables for the sake of software caching.<\/p>\n<p>Please keep in mind that, <strong>volatile<\/strong> has nothing to do with CPU caches and write buffers, and there is no standard support for these features in C\/C++. These are out of scope of the C\/C++ language, and must be solved by directly interacting with the CPU core!<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-1658 aligncenter\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_endof.png\" alt=\"\" width=\"27\" height=\"27\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><strong>Getting Hands Dirty via Low-Level CPU Cache Control<\/strong><\/p>\n<p>Software driven hardware cache management is possible. There are special <span style=\"color: #800000;\"><strong><em>&#8216;privileged&#8217;<\/em><\/strong><\/span> Assembler instructions to clean, invalidate, flush cache(s), and synchronize the write buffer. They can be directly executed from privileged modes. (User mode applications can control the cache through system calls only.) Most compilers support this through built-in\/intrinsic functions or inline Assembler.<\/p>\n<p>The Intel 64 and IA-32 architectures provide a variety of mechanisms for controlling the caching of data and instructions, and for controlling the ordering of reads\/writes between the processor, the caches, and memory.<\/p>\n<p>These mechanisms can be divided into two groups:<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 <strong>Cache control registers and bits:<\/strong> The Intel 64 and IA-32 architectures define several dedicated registers and various bits within control registers and page\/directory-table entries that control the caching system memory locations in the L1, L2, and L3 caches. These mechanisms control the caching of virtual memory pages and of regions of physical memory.<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 <strong>Cache control and memory ordering instructions:<\/strong> The Intel 64 and IA-32 architectures provide several instructions that control the caching of data, the ordering of memory reads and writes, and the prefetching of data. These instructions allow software to control the caching of specific data structures, to control memory coherency for specific locations in memory, and to force strong memory ordering at specific locations in a program.<\/p>\n<p><strong>How does it work?<\/strong><\/p>\n<p>The <strong><em><span style=\"color: #800000;\">Cache Control flags<\/span><\/em><\/strong> and <strong><span style=\"color: #800000;\"><em>Memory Type Range Registers (MTRRs)<\/em><\/span><\/strong> operate hierarchically for restricting caching. That is, if the <strong>CD flag<\/strong> of control <strong>register 0<\/strong> (CR0) is set, caching is prevented globally. If the <strong>CD flag<\/strong> is clear, the page-level <strong>cache control flags<\/strong> and\/or the <strong>MTRRs<\/strong> can be used to restrict caching.<\/p>\n<div style=\"border: 1px solid red; padding: 25px; margin: 25px;\"><strong>Tip:<\/strong> The memory type range registers <strong>(MTRRs)<\/strong> provide a mechanism for associating the memory types with physical-address ranges in system memory. They allow the processor to optimize operations for different types of memory such as RAM, ROM, frame-buffer memory, and memory-mapped I\/O devices. They also simplify system hardware design by eliminating the memory control pins used for this function on earlier IA-32 processors and the external logic needed to drive them.<\/div>\n<p>If there is an overlap of <strong>page-level<\/strong> and <strong>MTRR caching controls<\/strong>, the mechanism that prevents caching has precedence. For example, if an <strong>MTRR<\/strong> makes a region of system memory <strong>uncacheable<\/strong>, a page-level caching control <strong><em><span style=\"color: #800000;\">cannot be used<\/span><\/em><\/strong> to enable caching for a page in that region. The converse is also true; that is, if a page-level caching control designates a page as uncacheable, an MTRR cannot be used to make the page cacheable.<\/p>\n<p>In cases where there is a overlap in the assignment of the <strong>write-back<\/strong> and <strong>write-through<\/strong> caching policies to a page and a region of memory, the<strong><em><span style=\"color: #800000;\"> write-through policy takes precedence<\/span><\/em><\/strong>. The write-combining policy -which can only be assigned through an <strong>MTRR<\/strong> or Page Attribute Table <strong>(PAT)<\/strong>&#8211; takes precedence over either write-through or write-back. The selection of memory types at the page level varies depending on whether PAT is being used to select memory types for pages.<\/p>\n<div style=\"border: 1px solid red; padding: 25px; margin: 25px;\"><strong>Tip:<\/strong> The Page Attribute Table <strong>(PAT)<\/strong> extends the IA-32 architecture\u2019s page-table format to allow memory types to be assigned to regions of physical memory based on linear address mappings. The PAT is a companion feature to the MTRRs; that is, the MTRRs allow mapping of memory types to regions of the physical address space, where the PAT allows mapping of memory types to pages within the linear address space. The MTRRs are useful for statically describing memory types for physical ranges, and are typically set up by the system BIOS. The PAT extends the functions of the <strong>PCD<\/strong> and <strong>PWT bits<\/strong> in page tables to allow all five of the memory types that can be assigned with the MTRRs (plus one additional memory type) to also be assigned dynamically to pages of the linear address space.<\/div>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-1658 aligncenter\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_endof.png\" alt=\"\" width=\"27\" height=\"27\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><strong>CPU Control Registers<\/strong><\/p>\n<p>Generally speaking, control registers (<strong>CR0<\/strong>, <strong>CR1<\/strong>, <strong>CR2<\/strong>, <strong>CR3<\/strong>, and <strong>CR4<\/strong>) determine operating mode of the processor and the characteristics of the currently executing task. These registers are 32 bits in all 32-bit modes and compatibility mode. In 64-bit mode, control registers are expanded to 64 bits.<\/p>\n<p><a href=\"http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_ControlRegisters_Detailed.jpg\" data-rel=\"lightbox-image-1\" data-rl_title=\"CPU Control Registers\" data-rl_caption=\"\" title=\"CPU Control Registers\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-3139\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_ControlRegisters_Detailed.jpg\" alt=\"\" width=\"1547\" height=\"1289\" srcset=\"http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_ControlRegisters_Detailed.jpg 1547w, http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_ControlRegisters_Detailed-300x250.jpg 300w, http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_ControlRegisters_Detailed-768x640.jpg 768w, http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_ControlRegisters_Detailed-1024x853.jpg 1024w\" sizes=\"auto, (max-width: 1547px) 100vw, 1547px\" \/><\/a><\/p>\n<p>The <strong>MOV CRn<\/strong> instructions are used to manipulate the register bits. These instructions can be executed only when the current privilege level is 0.<\/p>\n<table style=\"width: 100%;\">\n<tbody>\n<tr>\n<th style=\"text-align: center; vertical-align: middle;\">Instruction<\/th>\n<th style=\"text-align: center; vertical-align: middle;\">64-bit Mode<\/th>\n<th style=\"text-align: center; vertical-align: middle;\">Legacy Mode<\/th>\n<th style=\"text-align: center; vertical-align: middle;\">Description<\/th>\n<\/tr>\n<tr>\n<td style=\"vertical-align: middle;\" width=\"33%\">MOV r32, CR0\u2013CR7<\/td>\n<td style=\"text-align: center; vertical-align: middle;\" width=\"15%\">&#8211;<\/td>\n<td style=\"text-align: center; vertical-align: middle;\" width=\"15%\">Valid<\/td>\n<td style=\"text-align: center;\">Move control register to r32.<\/td>\n<\/tr>\n<tr>\n<td style=\"vertical-align: middle;\">MOV r64, CR0\u2013CR7<\/td>\n<td style=\"text-align: center; vertical-align: middle;\">Valid<\/td>\n<td style=\"text-align: center; vertical-align: middle;\">&#8211;<\/td>\n<td style=\"text-align: center;\">Move extended control register to r64.<\/td>\n<\/tr>\n<tr>\n<td style=\"vertical-align: middle;\">MOV r64, CR8<\/td>\n<td style=\"text-align: center; vertical-align: middle;\">Valid<\/td>\n<td style=\"text-align: center; vertical-align: middle;\">&#8211;<\/td>\n<td style=\"text-align: center;\">Move extended CR8 to r64.<\/td>\n<\/tr>\n<tr>\n<td style=\"vertical-align: middle;\">MOV CR0\u2013CR7, r32<\/td>\n<td style=\"text-align: center; vertical-align: middle;\">&#8211;<\/td>\n<td style=\"text-align: center; vertical-align: middle;\">Valid<\/td>\n<td style=\"text-align: center;\">Move r32 to control register.<\/td>\n<\/tr>\n<tr>\n<td style=\"vertical-align: middle;\">MOV CR0\u2013CR7, r64<\/td>\n<td style=\"text-align: center; vertical-align: middle;\">Valid<\/td>\n<td style=\"text-align: center; vertical-align: middle;\">&#8211;<\/td>\n<td style=\"text-align: center;\">Move r64 to extended control register.<\/td>\n<\/tr>\n<tr>\n<td style=\"vertical-align: middle;\">MOV CR8, r64<\/td>\n<td style=\"text-align: center; vertical-align: middle;\">Valid<\/td>\n<td style=\"text-align: center; vertical-align: middle;\">&#8211;<\/td>\n<td style=\"text-align: center;\">Move r64 to extended CR8.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div style=\"border: 1px solid red; padding: 25px; margin: 25px;\"><strong>Tip:<\/strong> When loading control registers, programs should not attempt to change the <strong>reserved bits<\/strong>; that is, always set reserved bits to the value previously read. An attempt to change CR4&#8217;s reserved bits will cause a general protection fault. Reserved bits in CR0 and CR3 remain clear after any load of those registers; attempts to set them have no impact.<\/div>\n<p>The Intel 64 and IA-32 architectures provide the following cache-control registers and bits for use in enabling or restricting caching to various pages or regions in memory:<\/p>\n<p><a href=\"http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_ControlRegisters.jpg\" data-rel=\"lightbox-image-2\" data-rl_title=\"CPU Cache Control Flags and Memory Type Range Registers\" data-rl_caption=\"\" title=\"CPU Cache Control Flags and Memory Type Range Registers\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-3104\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_ControlRegisters.jpg\" alt=\"\" width=\"1455\" height=\"1615\" srcset=\"http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_ControlRegisters.jpg 1455w, http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_ControlRegisters-270x300.jpg 270w, http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_ControlRegisters-768x852.jpg 768w, http:\/\/mertboru.com\/wp-content\/uploads\/2018\/11\/CPUCache_ControlRegisters-923x1024.jpg 923w\" sizes=\"auto, (max-width: 1455px) 100vw, 1455px\" \/><\/a><\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 <strong>CD flag (bit 30 of control register CR0):<\/strong> Controls caching of system memory locations. If the CD flag is clear, caching is enabled for the whole of system memory, but may be restricted for individual pages or regions of memory by other cache-control mechanisms. When the CD flag is set, caching is restricted in the processor\u2019s caches (cache hierarchy) for the P6 and more recent processor families. With the CD flag set, however, the caches will still respond to snoop traffic. Caches should be explicitly flushed to insure memory coherency. For highest processor performance, both the CD and the NW flags in control register CR0 should be cleared. To insure memory coherency after the CD flag is set, the caches should be explicitly flushed. (Setting the CD flag for the P6 and more recent processor families modify cache line fill and update behaviour. Also, setting the CD flag on these processors do not force strict ordering of memory accesses unless the MTRRs are disabled and\/or all memory is referenced as uncached.)<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 <strong>NW flag (bit 29 of control register CR0):<\/strong> Controls the write policy for system memory locations. If the NW and CD flags are clear, write-back is enabled for the whole of system memory, but may be restricted for individual pages or regions of memory by other cache-control mechanisms.<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 <strong>PCD and PWT flags (in paging-structure entries):<\/strong> Control the memory type used to access paging structures and pages.<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 <strong>PCD and PWT flags (in control register CR3):<\/strong> Control the memory type used to access the first paging structure of the current paging-structure hierarchy.<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 <strong>G (global) flag in the page-directory and page-table entries:<\/strong> Controls the flushing of TLB entries for individual pages.<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 <strong>PGE (page global enable) flag in control register CR4:<\/strong> Enables the establishment of global pages with the G flag.<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 <strong>Memory type range registers (MTRRs):<\/strong> Control the type of caching used in specific regions of physical memory.<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 <strong>Page Attribute Table (PAT) MSR:<\/strong> Extends the memory typing capabilities of the processor to permit memory types to be assigned on a page-by-page basis.<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 <strong>3rd Level Cache Disable flag (bit 6 of IA32_MISC_ENABLE MSR):<\/strong> Allows the L3 cache to be disabled and enabled, independently of the L1 and L2 caches. (Available only in processors based on Intel NetBurst microarchitecture)<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 <strong>KEN# and WB\/WT# pins (Pentium processor):<\/strong> Allow external hardware to control the caching method used for specific areas of memory. They perform similar (but not identical) functions to the MTRRs in the P6 family processors.<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 <strong>PCD and PWT pins (Pentium processor):<\/strong> These pins (which are associated with the PCD and PWT flags in control register CR3 and in the page-directory and page-table entries) permit caching in an external L2 cache to be controlled on a page-by-page basis, consistent with the control exercised on the L1 cache of these processors. (The P6 and more recent processor families do not provide these pins because the L2 cache is embedded in the chip package.)<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-1658 aligncenter\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_endof.png\" alt=\"\" width=\"27\" height=\"27\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><strong>How to Manage CPU Cache using Assembly Language<\/strong><\/p>\n<p>The Intel 64 and IA-32 architectures provide several instructions for managing the L1, L2, and L3 caches. The <strong>INVD<\/strong> and <strong>WBINVD<\/strong> instructions are privileged instructions and operate on the L1, L2 and L3 caches as a whole. The <strong>PREFETCHh<\/strong>, <strong>CLFLUSH<\/strong> and <strong>CLFLUSHOPT<\/strong> instructions and the non-temporal move instructions (<strong>MOVNTI<\/strong>, <strong>MOVNTQ<\/strong>, <strong>MOVNTDQ<\/strong>, <strong>MOVNTPS<\/strong>, and <strong>MOVNTPD<\/strong>) offer more granular control over caching, and are available to all privileged levels.<\/p>\n<p>The <strong>INVD<\/strong> and <strong>WBINVD<\/strong> instructions are used to invalidate the contents of the L1, L2, and L3 caches. The <strong>INVD<\/strong> instruction invalidates all internal cache entries, then generates a special-function bus cycle that indicates that external caches also should be invalidated. The <strong>INVD<\/strong> instruction <em><span style=\"color: #800000;\"><strong>should be used with care<\/strong><\/span><\/em>. It does not force a write-back of modified cache lines; therefore, data stored in the caches and not written back to system memory will be lost. Unless there is a specific requirement or benefit to invalidating the caches without writing back the modified lines (such as, during testing or fault recovery where cache coherency with main memory is not a concern), software should use the <strong>WBINVD<\/strong> instruction.<\/p>\n<p>In theory, <strong>WBINVD<\/strong> instruction performs the following steps:<\/p>\n<pre style=\"padding-left: 30px;\">WriteBack(InternalCaches);\r\nFlush(InternalCaches);\r\nSignalWriteBack(ExternalCaches);\r\nSignalFlush(ExternalCaches);\r\nContinue;<\/pre>\n<p>The <strong>WBINVD<\/strong> instruction first writes back any modified lines in all the internal caches, then invalidates the contents of both the L1, L2, and L3 caches. It ensures that cache coherency with main memory is maintained regardless of the write policy in effect (that is, write-through or write-back). Following this operation, the <strong>WBINVD<\/strong> instruction generates one (P6 family processors) or two (Pentium and Intel486 processors) special-function bus cycles to indicate to external cache controllers that write-back of modified data followed by invalidation of external caches should occur. The amount of time or cycles for <strong>WBINVD<\/strong> to complete will vary due to the size of different cache hierarchies and other factors. As a consequence, the use of the <strong>WBINVD<\/strong> instruction can have an impact on interrupt\/event response time.<\/p>\n<p>The <strong>PREFETCHh<\/strong> instructions allow a program to suggest to the processor that a cache line from a specified location in system memory be prefetched into the cache hierarchy.<\/p>\n<p>The <strong>CLFLUSH<\/strong> and <strong>CLFLUSHOPT<\/strong> instructions allow selected cache lines to be flushed from memory. These instructions give a program the ability to explicitly free up cache space, when it is known that cached section of system memory will not be accessed in the near future.<\/p>\n<p>The non-temporal move instructions (<strong>MOVNTI<\/strong>, <strong>MOVNTQ<\/strong>, <strong>MOVNTDQ<\/strong>, <strong>MOVNTPS<\/strong>, and <strong>MOVNTPD<\/strong>) allow data to be moved from the processor\u2019s registers directly into system memory without being also written into the L1, L2, and\/or L3 caches. These instructions can be used to prevent cache pollution when operating on data that is going to be modified only once before being stored back into system memory. These instructions operate on data in the general-purpose, MMX, and XMM registers.<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-1658 aligncenter\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_endof.png\" alt=\"\" width=\"27\" height=\"27\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><strong>How to Disable Hardware Caching<\/strong><\/p>\n<p>To disable the L1, L2, and L3 caches after they have been enabled and have received cache fills, perform the following steps:<\/p>\n<p style=\"padding-left: 30px;\"><strong>1.)<\/strong> Enter the no-fill cache mode. (Set the <strong>CD flag<\/strong> in control register <strong>CR0<\/strong> to 1 and the <strong>NW<\/strong> flag to 0.<\/p>\n<p style=\"padding-left: 30px;\"><strong>2.)<\/strong> Flush all caches using the <strong>WBINVD<\/strong> instruction.<\/p>\n<p style=\"padding-left: 30px;\"><strong>3.)<\/strong> Disable the <strong>MTRRs<\/strong> and set the default memory type to uncached or set all <strong>MTRRs<\/strong> for the uncached memory type.<\/p>\n<p>The caches must be flushed (step 2) after the <strong>CD flag<\/strong> is set to insure system memory coherency. If the caches are not flushed, cache hits on reads<span style=\"color: #800000;\"> <strong><em>will still occur<\/em><\/strong><\/span> and data will be read from valid cache lines.<br \/>\nThe intent of the three separate steps listed above address three distinct requirements:<\/p>\n<p style=\"padding-left: 30px;\"><strong>a.)<\/strong> Discontinue new data replacing existing data in the cache,<\/p>\n<p style=\"padding-left: 30px;\"><strong>b.)<\/strong> Ensure data already in the cache are evicted to memory,<\/p>\n<p style=\"padding-left: 30px;\"><strong>c.)<\/strong> Ensure subsequent memory references observe UC memory type semantics. Different processor implementation of caching control hardware may allow some variation of software implementation of these three requirements.<\/p>\n<p>Setting the <strong>CD flag<\/strong> in control register <strong>CR0<\/strong> modifies the processor\u2019s caching behaviour as indicated, but setting the <strong>CD flag<\/strong> alone may not be sufficient across all processor families to force the effective memory type for all physical memory to be UC nor does it force strict memory ordering, due to hardware implementation variations across different processor families. To force the UC memory type and strict memory ordering on all of physical memory, it is sufficient to either program the <strong>MTRRs<\/strong> for all physical memory to be UC memory type or disable all <strong>MTRRs<\/strong>.<\/p>\n<div style=\"border: 1px solid red; padding: 25px; margin: 25px;\"><strong>Tip:<\/strong> For the Pentium 4 and Intel Xeon processors, after the sequence of steps given above has been executed, the cache lines containing the code between the end of the <strong>WBINVD<\/strong> instruction and before the <strong>MTRRS<\/strong> have actually been disabled may be retained in the cache hierarchy. Here, to remove code from the cache completely, a second <strong>WBINVD<\/strong> instruction must be executed after the <strong>MTRRs<\/strong> have been disabled.<\/div>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-1658 aligncenter\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_endof.png\" alt=\"\" width=\"27\" height=\"27\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><strong>References:<\/strong><\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 Richard Blum, <em>&#8220;Professional Assembly Language&#8221;<\/em>, Wrox Publishing &#8211; (2005)<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 Keith Cooper &amp; Linda Torczon, <em>&#8220;Engineering A Compiler&#8221;<\/em>, Morgan Kaufmann, 2nd Edition &#8211; (2011)<\/p>\n<p style=\"padding-left: 30px;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 Alexey Lyashko, <em>&#8220;Mastering Assembly Programming&#8221;<\/em>, Packt Publishing Limited &#8211; (2017)<\/p>\n<p style=\"padding-left: 30px;\"><em><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 &#8220;Intel\u00ae 64 and IA-32 Architectures Optimization Reference Manual&#8221;<\/em> &#8211; (April 2018)<\/p>\n<p style=\"padding-left: 30px;\"><em><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 &#8220;Intel\u00ae 64 and IA-32 Architectures Software Developer\u2019s Manual: Basic Architecture&#8221;<\/em> &#8211; (November 2018)<\/p>\n<p style=\"padding-left: 30px;\"><em><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 &#8220;Intel\u00ae 64 and IA-32 Architectures Software Developer\u2019s Manual: Instruction Set Reference A-Z&#8221;<\/em> &#8211; (November 2018)<\/p>\n<p style=\"padding-left: 30px;\"><em><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 &#8220;Intel\u00ae 64 and IA-32 Architectures Software Developer\u2019s Manual: System Programming Guide&#8221;<\/em> &#8211; (November 2018)<\/p>\n<p style=\"padding-left: 30px;\"><em><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1650\" src=\"http:\/\/mertboru.com\/wp-content\/uploads\/2015\/03\/z.webbullet_dashed.png\" alt=\"\" width=\"11\" height=\"11\" \/>\u00a0 &#8220;Intel\u00ae 64 and IA-32 Architectures Software Developer\u2019s Manual: Model-Specific Registers&#8221;<\/em> &#8211; (November 2018)<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>(Cover Photo:\u00a0 \u00a9 Granger &#8211; &#8220;Lion Tamer&#8221; The American animal tamer Clyde Beatty performing in the 1930s.) The processor&#8217;s caches are for the most part transparent to software. When enabled, instructions and data flow through these caches without the need for explicit software control. However, knowledge of the behavior of these caches may be useful &hellip; <a href=\"http:\/\/mertboru.com\/?p=3073\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Taming a Beast: CPU Cache<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":3080,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[26,17,7],"tags":[175,177,179,180,194,195,204,178,174,186,185,191,205,206,207,187,198,196,200,199,197,190,188,189,193,183,184,176,192,182,181],"class_list":["post-3073","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-articles","category-retrodemo","category-gamedev","tag-c-c","tag-cache","tag-cache-hit","tag-cache-line-fill","tag-clflush","tag-clflushopt","tag-const","tag-control-register","tag-cpu","tag-ia-32","tag-intel-64","tag-invd","tag-l1-cache","tag-l2-cache","tag-l3-cache","tag-memory-type-range-registers","tag-movntdq","tag-movnti","tag-movntpd","tag-movntps","tag-movntq","tag-mtrr","tag-page-attribute-table","tag-pat","tag-prefetchh","tag-snooping","tag-translation-lookaside-buffer","tag-volatile","tag-wbinvd","tag-write-allocation","tag-write-hit"],"_links":{"self":[{"href":"http:\/\/mertboru.com\/index.php?rest_route=\/wp\/v2\/posts\/3073","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/mertboru.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/mertboru.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/mertboru.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/mertboru.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3073"}],"version-history":[{"count":1,"href":"http:\/\/mertboru.com\/index.php?rest_route=\/wp\/v2\/posts\/3073\/revisions"}],"predecessor-version":[{"id":3703,"href":"http:\/\/mertboru.com\/index.php?rest_route=\/wp\/v2\/posts\/3073\/revisions\/3703"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/mertboru.com\/index.php?rest_route=\/wp\/v2\/media\/3080"}],"wp:attachment":[{"href":"http:\/\/mertboru.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3073"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/mertboru.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3073"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/mertboru.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3073"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}