I love doing research and compiling data in databases. Since childhood I created numerous databases manually, for example a database of car models and their production years, by browsing Wikipedia for each car model and writing them into my databases. Original databases with no equivalent on internet!
I do not have programming experience. I was not aware of possibility to do scraping. In August 2015 I did several Google searches related to scraping and found Import.io. It changed my life, an easy to use do-it-yourself tool, allowing me to quickly create new databases by scraping data from other websites, for my personal research, which would take many hours copying data manually (do note that copying other websites can bring you into legal issues, especially if you use their data commercially, such as creating your own website).
Import.io was a free software with no limits, supported by people hiring their staff to do scraping in their place. I was imputing a list of URL and extract them in bulk, at rate of 1 page per second, but slowing down over time, so was better to run in batches taking max 5-10 hours.
In April 2016 Import.io went through a major update, removing desktop application for new sign ups and introduced cloud extraction, with free plans limited at 10.000 queries per month, as well as paid plans starting from $249 per month for 50.000 queries per month. They assured me via email that people who signed up prior to March 2016 can still use their software for free without limits.
However, at end of August 2016 they send me an email saying that I am exceeding the limits of free plan, having over 90.000 queries last 30 days, and gave me 2 options: to reduce number of queries to maximum 10.000 per month, or to upgrade to a paid plan, they also said that there are many “zombie accounts” like mine and if I don’t reply, they will suspend my account. I replied, but they did not replied back. I continued to scrap websites with large number of queries, because had projects on TO-DO list, and on 10th September they sent me another email saying that suspended my account for continuous usage over free tier limits.
While the account for desktop application was suspended, I was still able to use their cloud extraction, but limited to 10.000 queries per month. Number of queries was going to be reset on 12th each month, but after this date I was no longer able to sign in, most likely because they suspended my cloud account too. I tried to sign up with a new account, but realized that they just limited free plans to 500 queries per month since 14th or 15th September.
By this way I realized how risky is to run a business based on a third-party service that change prices arbitrarily. I have been able to earn some money by offering web scraping services to various customers, but not sufficient to afford the $249 / 50.000 queries plan and my customers were requesting me to scrap websites with more than 100.000 pages.
I posted my story in Stackoverflow, someone who answered me gave me a list of 10+ do-it-yourself scraping tools. I was not aware that there are so many alternatives of import.io. I started testing each one, all them being harder to use than Import.io, including Octoparse, some of them cost money but none being so expensive like Import.io, but I decided to pay more attention to Octoparse and learn how to use it, because it is a FREE software.
After days of learning, Octoparse allowed me to create new databases by scraping websites that require user to click buttons or enter text, which was not possible with Import.io, but I had also projects done with import.io that cannot be done with Octoparse.
The GOOD things of Octoparse
- FREE plan that do not have any limit of number of pages or records to be scraped.
- PAID plans are quite cheap and also do not have any limit of number of pages like import.io.
- Ability to scrap websites that require user to input data, click buttons, go through pagination, infinite scrolling, etc.
- Octoparse User Club on Facebook where you can ask for help from other users and even staff members.
The BAD things of Octoparse
- Extremely slow speed, this browser-based scraping never go faster than 3-4 seconds per page, in certain websites it was taking 1 minute per page. I get often projects to extract 10,000+ or even 100,000+ pages which takes few days to scrap in free plan, too much so I rather don’t use Octoparse.
- Lacking ability to scrap multiple matching XPath per page.
- Hard to use, is a very complex software with a lot of options, it offer several tutorials that are helpful to learn how to scrap simple websites, but certain websites I did not succeeded to make scraping tasks without asking their staff to do for me. Editing XPath or doing other simple tasks takes a lot more clicks and time than at other scraping software.
- Lots of bugs, one of them is that when adding a blank field then editing XPath, it do not extract anything, I need to add a field by clicking a random thing on page then edit XPath. Also it stops at random times, especially in pagination it click next page few times then stop, so I need to run it again and again until it scraps all results. Another bug: when extracting a list or table, when you click fields to extract, it extract whole list, but if you edit XPath will made to extract only 1st item in the list.
In 2015, soon after discovering import.io, I also got friend with a student programmer from Pakistan. I paid him 3 times to make custom scraping applications in Visual Basic for websites too complex for import.io. In September 2016, hearing my story with import.io account suspension, he got idea to make himself a universal scraping software in his spare time, and dreams that some day he will take away users from Import.io. After one month and dozens of hours of coding during weekends, this student made “Simple Web Scraper” that do same job like import.io: input list of URL and bulk extract matching XPath, at speed varying from 1 to 10 pages per second, much faster than Octoparse, but lacking features such as pagination, infinite scrolling, clicking buttons, input text, etc.
Simple Web Scraper is in BETA and is not available for purchase at this moment, I am his only user and I am paying for it, but he is planning to release it commercially in the future.
Currently I am using both Octoparse and Simple Web Scraper, often for same project. Octoparse to go through lists and search pages, buttons and pagination, to get URLs of products, then input URLs in Simple Web Scraper to extract product details.
NEWS: Octoparse IDIOT staff locked my account!
I wrote the above article in March 2017 when Octoparse offered us 1 month FREE professional plan for who review their software, even criticizing. They say that my article is OK and I am eligible for professional plan. I also told them that I am making money using Octoparse and can pay professional plan if I have big projects and need more power than in free plan, but currently I am not using Octoparse on regular basis, I do not have any immediate project so please keep my offer for later. They say that anytime I want 1-month FREE professional plan to be started, to contact them.
In May 2017, one of my customers put pressure on me to do scraping a car tire website with Load More button, which Simple Web Scraper cannot handle, I tried to build an extractor but didn’t managed to click Load more, I asked help in Octoparse Users Club, another user made an .otd file to click Load More button and extract list of product name, but when I tried to edit XPath to extract product URL instead of product name, I was getting in the BIG fucking bug that cause extraction of 1st item on every row, I said in User Club that their software is bugged and impossible to do my project and asked them how to solve this problem, next day I found myself kicked out of Users Club, I tried to join again and they banned me without any explanation.
Few days later my Octoparse account (Future) was also locked. I quickly created a new account (Teoalida) because I needed to update one of my databases by re-scraping a website which I was doing happily every 2 months with Octoparse.
As terrible coincidence, 2 weeks later, in just one week been contacted by 2 customers from Australia asking me if I have or I can create a car database for their country. This was one of potential future projects I was planning to do if I find a customer interested. The Australian car website have anti-scraping features blocking Simple Web Scraper which was taking 30 sec to scrap a page and missing 60% of pages. Octoparse was surprisingly working at a rate of 12 seconds per page. With 92000 cars in website, it would have taken 300 hours to create the database, that means to keep computer running 15 hours daily for 20 days. The FREE 1-month professional plan would have been handy in a such moment. I emailed them again asking to give me promised award for writing this article on my website, but they ignored me. I managed to get in touch with their staff via live chat in their official website.
See the chat:
Hi ?? Have a look around! Let us know if you have any questions.
I sent several emails and haven’t heard any reply
Could you please let me know what you email is?
I am long-time user… 9 months
What’s your trouble?
in 9 months I successfully done several project with your software but I also got frustrated when your software FAILED to do my projects. And now my account is locked… why?
How I can update regularly my projects if you suspended my account?
It’s a free software. Like it or not…
it’s free but extremely slow so I was thinking to try professional plan
No you don’t..
I don’t… what?
If you hate it, don’t use it. Thank you.
right now I got a project to scrap 92000+ pages which will take 20 days in free plan
You’ve been bitching it about 9 months…
love or hate matter less, your software is the ONLY way to scrap this particular website
You have a awesome partner to get what you want.
LOL, you get angry because I have a friend who made himself a scraper?
I DON’T get what do I want with his scraper. For several projects I use Octoparse and his scraper in parallel
1h ago. Seen
In conclusion: Octoparse staff HATE ME, a 9-month loyal and potentially-paying customer because I raged on Facebook about bugs and asked for solution. They do not understand that I also done successful projects with Octoparse.
Given by the project size and price, I really wanted to PAY Octoparse professional plan but the staff attitude changed my mind. Worried that they can lock my Teoalida account, I created a second account using a secret email and applied for 5-day trial of professional plan. Surprisingly, they awarded 5-day trial on Teoalida account. I tried cloud extraction, but in case of Australian car website it was even slower than local extraction. Tested other 3 projects and they were about 2-3 times faster in cloud. But the professional plan allowed me to run more than 2 tasks locally at same time. I started 8 local tasks simultaneously to scrap Australian car website faster (each task having a number of URLs that would have taken 5 hours to scrap), but 7 of them crashed at random moments in few hours. I tried again starting 4 days simultaneously, they crashed as well. I had to run only 2 tasks simultaneously and this was still not crash-proof, and I got an idea to install Octoparse on a virtual machine and run another 2 tasks at same time. This allowed me to create 92000 cars database in about 9 days and get paid for this.
Update: I managed to scrap car tire website after my student friend told me to use Inspect element > Network > copy URL which was loading using Load more button, allowing me to get URL of pages displaying all products and scrap them with Simple Web Scraper (or List or Table extraction in Octoparse). If I knew this at the right time I would not have asked for help in Octoparse User Club and this horror story never happened. A fucking customer destroyed my good friendship with Octoparse staff.
Personally I am still using Octoparse, in free plan, with happiness in extracting complex but small websites but also with frustration trying to figure out how to extract certain websites without having access to Octoparse User Club in Facebook to ask for help. At same time I use Simple Web Scraper (made by Pakistan student specially for my needs), for simple projects, I sometimes use both them for same project. Octoparse to go through lists and search pages, buttons and pagination, to get URLs of products, then input URLs in Simple Web Scraper to extract product details faster.
But since Simple Web Scraper will be a paid software, Octoparse remains the ONLY known free scraping software with no limit in number of URL to extract. Hard to use, bugged, prone to crashes, but FREE and useful (only) in small data extraction jobs.
Under any circumstances, DO NOT PAY Octoparse!
Do not feed the idiots who locked my account after 9 months of being a loyal customer!
Professional plan is barely useful, it do not solve bugs or make Octoparse less prone to crashes, and the 3x faster extraction is not a big deal. If you have money to spend and need a large extraction I suggest buying other scraping software which is faster and do not crash. Or contact me and I will offer web scraping service for you with Simple Web Scraper or even a custom app.
Although Octoparse allow you to extract “List and Detail”, I do not recommend using it, instead I recommend making one task for list of products that extract URL only, and a second task using “List of URLs” where you can input URLs extracted in first task, so in case of software crash, you can start again only with the URLs not extracted already.
If you want more than 10 tasks you can export tasks and keep .otd files in your computer, delete them from Octoparse and import them back when you need them.
If you want faster speed while needing to extract a list of URLs you can split up list of URL in multiple batches and run 2 tasks at same time on one computer, and create virtual machines using VMware. You do not need multiple Octoparse accounts.