↑ Return to Fun

Octoparse review

I love doing research and compiling data in databases. Since childhood I created numerous databases manually, for example a database of car models and their production years, by browsing Wikipedia for each car model and writing them into my databases. Original databases with no equivalent on internet!

I do not have programming experience. I was not aware of possibility to do scraping. In August 2015 I did several Google searches related to scraping and found Import.io. It changed my life, an easy to use do-it-yourself tool, allowing me to quickly create new databases by scraping data from other websites, for my personal research, which would take many hours copying data manually (do note that copying other websites can bring you into legal issues, especially if you use their data commercially, such as creating your own website). Using Import.io I was imputing a list of URL and extract them in bulk, at rate of 1 page per second, but slowing down over time, so was better to run in batches taking max 5-10 hours.

In April 2016 Import.io went through a major update, removing desktop application for new sign ups and introduced cloud extraction, with free plans limited at 10.000 queries per month, as well as paid plans starting from $249 per month for 50.000 queries per month. They assured me via email that people who signed up prior to March 2016 can still use their software for free without limits.

However, at end of August 2016 they send me an email saying that I am exceeding the limits of free plan, having over 90.000 queries last 30 days, and gave me 2 options: to reduce number of queries to maximum 10.000 per month, or to upgrade to a paid plan, they also said that there are many “zombie accounts” like mine and if I don’t reply, they will suspend my account. I replied, but they did not replied me. I continued to scrap websites with large number of queries, because I NEEDED, and on 10th September they sent me another email saying that suspended my account for continuous usage over free tier limits.

While the account for desktop application was suspended, I was still able to use their cloud extraction, but limited to 10.000 queries per month. Number of queries was going to be reset on 12th each month, but after this date I was no longer able to sign in, most likely because they suspended my cloud account too. I tried to sign up with a new account, but realized that they just limited free plans to 500 queries per month since 14th or 15th September.

By this way I realized how risky is to run a business based on a third-party service that change prices arbitrarily. I have been able to earn some money by offering web scraping services to various customers, but not sufficient to afford the $249 / 50.000 queries plan and my customers were giving me to scrap websites with more than 100.000 pages.

I posted my story in Stackoverflow, someone who answered me gave me a list of 10+ do-it-yourself scraping tools. I was not aware that there are so many alternatives of import.io. I started testing each one, all them being harder to use than Import.io, including Octoparse, some of them cost money but none being so expensive like Import.io, but I decided to pay more attention to Octoparse and learn how to use it, because it is a FREE software.

After days of learning, Octoparse allowed me to create new databases by scraping websites that require user to click buttons or enter text, which was not possible with Import.io, but I had also projects done with import.io that cannot be done with Octoparse.

The GOOD things of Octoparse

  • FREE plan that do not have any limit of number of pages or records to be scraped.
  • PAID plans are quite cheap and also do not have any limit of number of pages like import.io.
  • Ability to scrap websites that require user to input data, click buttons, go through pagination, infinite scrolling, etc.

The BAD things of Octoparse

  • Extremely slow speed, this browser-based scraping never go faster than 3-4 seconds per page, in certain websites it was taking 1 minute per page. I get often projects with 10,000+ or even 100,000+ pages which takes few days to scrap in free plan, too much so I rather don’t use Octoparse.
  • Lacking ability to scrap multiple matching XPath per page.
  • Hard to use, is a very complex software with a lot of options, it offer several tutorials that are helpful to learn how to scrap simple websites, but certain websites I did not succeeded to make scraping tasks without asking their staff to do for me. Editing XPath or doing other simple tasks takes a lot more clicks and time than at other scraping software.
  • Lots of bugs, one of them is that when adding a blank field then editing XPath, it do not extract anything, I need to add a field by clicking a random thing on page then edit XPath. Also it stops at random times, especially in pagination it click next page few times then stop, so I need to run it again and again until it scraps all results.

In 2015, soon after discovering import.io, I also got friend with a student programmer. I paid him 3 times to make custom scraping applications in Visual Basic for websites too complex for import.io. In September 2016, hearing my story with import.io account suspension, he got idea to make himself a universal scraping software in his spare time, and dreams that some day he will take away users from Import.io. After one month and dozens of hours of coding during weekends, this student made a software that do same job like import.io: input list of URL and bulk extract matching XPath, at speed varying from 1 to 10 pages per second, much faster than Octoparse, but lacking features such as pagination, infinite scrolling, clicking buttons, input text, etc. His software is in BETA and is not available for purchase at this moment, I am his only user and I am paying for it.

Currently I am using both Octoparse and my programmer software, often for same project. Octoparse to go through lists and search pages, buttons and pagination, to get URLs of products, then input URLs in my programmer software to extract product details.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>