{"id":893,"date":"2015-03-23T22:13:29","date_gmt":"2015-03-23T20:13:29","guid":{"rendered":"https:\/\/0x0a.li\/?p=893"},"modified":"2023-04-04T12:39:05","modified_gmt":"2023-04-04T10:39:05","slug":"how-to-i-webscraping-mit-kimono","status":"publish","type":"post","link":"https:\/\/0x0a.li\/en\/how-to-i-webscraping-mit-kimono\/","title":{"rendered":"How to: Web scraping with Kimono"},"content":{"rendered":"<p><\/p>\n<div class=\"teaser-text\">\n<p>Many works on 0x0a have\u00a0as their starting point gigantic collections\u00a0of texts, called corpora. You can easily compile such\u00a0a corpus by yourself via the method of &#8220;web scraping.&#8221; For the 0x0a text &#8220;Chicken Infinite&#8221; \u2013 a 532-page recipe \u2013 the\u00a0text\u00a0corpus\u00a0consisted of\u00a0cooking instructions gathered from the internet. The tool for doing this was Kimono, a\u00a0web scraper I would like to introduce today.<\/p>\n<\/div>\n<p>Kimono by\u00a0<a href=\"https:\/\/www.kimonolabs.com\/\" target=\"_blank\" rel=\"noopener\">kimonolabs<\/a>\u00a0offers a user-friendly method for grabbing the contents of websites in a structured way. Without needing to know anything about programming, you can compile enormous collections of texts. This service\u00a0is especially useful when websites do not offer APIs, that is, interfaces with which one can gather data directly (as does Twitter). Instead, you build your own API with Kimono, and get as output a JSON, RSS, or CSV file that contains the text from the particular website.<\/p>\n<p><!--more--><\/p>\n<h3>Bookmarklet\u00a0or Chrome plugin<\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft wp-image-906\" style=\"margin: 0.5em 0.9em .2em 0;\" src=\"https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_logo-150x150.png\" alt=\"\" width=\"100\" height=\"100\" srcset=\"https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_logo-150x150.png 150w, https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_logo.png 256w\" sizes=\"(max-width: 100px) 100vw, 100px\" \/>In order to be able to use Kimono, you need to create a username and log in. Additionally, you need to install a browser\u00a0<a href=\"https:\/\/www.kimonolabs.com\/learn\/getstarted\" target=\"_blank\" rel=\"noopener\">bookmarklet<\/a>\u00a0or use the\u00a0<a href=\"https:\/\/help.kimonolabs.com\/hc\/en-us\/articles\/203339070-Install-the-kimono-chrome-extension\" target=\"_blank\" rel=\"noopener\">Google Chrome plugin<\/a>\u00a0for Kimono. After clicking the bookmarklet or the plugin button, you see a cached version of the website and\u00a0a layer with the Kimono interface.<\/p>\n<h3>Creating a\u00a0list of links<\/h3>\n<p>All elements the web scraper can grab\u00a0are now clickable. If you chose a\u00a0recipe from a list (like in the image below), Kimono automatically recognizes all other recipe links in the same column and asks whether it should select them as well.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-928 size-full\" src=\"https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_11.png\" alt=\"\" width=\"700\" height=\"394\" srcset=\"https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_11.png 700w, https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_11-300x169.png 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/>Once you have selected\u00a0everything you need, you give the data field (the output, which Kimono calls properties) a name: <em>Recipe_Link<\/em>. Kimono saves not only the name of the recipe (the HTML block) but also the URL (the\u00a0link to the actual recipe in the &#8220;src&#8221; attribute of the block).<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-932 size-full\" src=\"https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_2.png\" alt=\"\" width=\"700\" height=\"219\" srcset=\"https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_2.png 700w, https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_2-300x94.png 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>You can repeat this\u00a0step as often as you like. This way, you can save not only the link to the recipe but also, say, to the image associated with it, as long as it is part of the same HTML structure. For now, let&#8217;s just compile a list\u00a0of links to the recipes.<\/p>\n<h3>Pagination<\/h3>\n<p>Once you have selected the elements of a website you want to save, as a next step Kimono offers a neat function that allows you to extend the selection to the following pages by finding the &#8220;next page&#8221; link.\u00a0You do this by first clicking the pagination button (the book icon) and then clicking the link to the next page.\u00a0As soon as Kimono knows how to get to the next page, it automatically\u00a0finds its way through the myriad recipes that follow. This way, you can scrape as many pages as you like.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-930 size-full\" src=\"https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_31.png\" alt=\"\" width=\"700\" height=\"428\" srcset=\"https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_31.png 700w, https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_31-300x183.png 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>It\u00a0can even deal with<em>\u00a0<\/em>infinite scroll systems in which the content is reloaded via JavaScript at the bottom of the page once you scroll down (like in Tumblr).<br \/>\nThe pagination function is the key to assembling large data sets.<\/p>\n<h3>Scraping\u00a0after a login<\/h3>\n<p>Should you need to log into a page before you can scrape it, Kimono can switch to what it calls\u00a0<em>Auth mode<\/em>. You need to allow the Chrome plugin to operate in incognito mode. Kimono will save the user name and password and log in automatically.<\/p>\n<h3>Saving the API<\/h3>\n<p>As a last\u00a0step, you will reach a second screen that lets you configure the details of the web scraper. Here, you can adjust\u00a0the CSS path or the regular expressions string for each field you want to save (if you want to scrape more complex data structures). In most cases, you can leave everything as it is. Finally, you will see a preview of the output in either JSON, CSV or RSS format. Click\u00a0<em>Done,\u00a0<\/em>give the\u00a0API a name, a chose a time interval for the scraping process. Our API is called recipes, and we&#8217;ll scrape only once.<\/p>\n<h3>Combining APIs<\/h3>\n<p>Kimono can combine APIs.\u00a0Our first API collected links;\u00a0but we can build a second one that resolves every individual recipe link, scans the fields in the target URL, and downloads the actual recipe. We only need to tell Kimono which\u00a0fields to save.\u00a0Starting from\u00a0any recipe page that includes an\u00a0<i>ingredients\u00a0<\/i>and a\u00a0<em>recipe<\/em> field, and select both fields for extraction. This second API we call\u00a0<em>Recipe_Details<\/em>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-948\" src=\"https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_4.png\" alt=\"\" width=\"700\" height=\"285\" srcset=\"https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_4.png 700w, https:\/\/0x0a.li\/wp-content\/uploads\/2015\/03\/kimono_4-300x122.png 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>After having created the second API, you\u00a0can combine it with the first. You\u00a0do this by logging in to Kimono and going to the\u00a0<em>My APIs<\/em> section, which shows a list of all the APIs you have built. You click\u00a0your latest\u00a0API (<em>Recipe_Details<\/em>), navigate to\u00a0<em>Crawl Setup<\/em>, and select\u00a0your first API (<em>Recipe_Link<\/em>) in the field\u00a0<em>Source API<\/em>. After clicking\u00a0<em>Start Crawl<\/em>, Kimono will download all the recipes you\u00a0have identified with your first API. Clicking on the tab\u00a0<em>Data Field<\/em>\u00a0shows\u00a0shows you a preview of the data collection in JSON, RSS, or CSV format.<\/p>\n<p>Now, you have a corpus of thousands of recipes. \u00a0You can continue to process further \u2013 to sort, rearrange, parse or transform your corpus \u2013 according to\u00a0certain rules, for instance with a Python script or with a concordance software. We will show you how to do this another time.<\/p>","protected":false},"excerpt":{"rendered":"<p>Sorry, this entry is only available in German. For the sake of viewer convenience, the content is shown below in the alternative language. You may click the link to switch the active language. Bei 0x0a entstehen viele Texte, die als\u00a0Ausgangsmaterial gro\u00dfe Textdatens\u00e4tze, sogenannte Korpora, verwenden. Ein Korpus kann mit der Methode des\u00a0\u201eWebscraping\u201c auf einfachen Weg [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[75],"tags":[],"acf":[],"_links":{"self":[{"href":"https:\/\/0x0a.li\/en\/wp-json\/wp\/v2\/posts\/893"}],"collection":[{"href":"https:\/\/0x0a.li\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/0x0a.li\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/0x0a.li\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/0x0a.li\/en\/wp-json\/wp\/v2\/comments?post=893"}],"version-history":[{"count":88,"href":"https:\/\/0x0a.li\/en\/wp-json\/wp\/v2\/posts\/893\/revisions"}],"predecessor-version":[{"id":1686,"href":"https:\/\/0x0a.li\/en\/wp-json\/wp\/v2\/posts\/893\/revisions\/1686"}],"wp:attachment":[{"href":"https:\/\/0x0a.li\/en\/wp-json\/wp\/v2\/media?parent=893"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/0x0a.li\/en\/wp-json\/wp\/v2\/categories?post=893"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/0x0a.li\/en\/wp-json\/wp\/v2\/tags?post=893"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}