I am making vehicle databases for Europe since 2003, for America since 2014, for India since 2015… what is next to do in 2016? Maybe Australia?
I created a page titled Australia in May 2016 for marketing experiment to see how many people click it, and decide whenever worth my effort to create a car database for a country with only 25 million people.
The page got very low traffic for one year, but during last days of May 2017 I got 2 people interested in purchasing an Australian car database, so I started studying possible sources of data, another 2 people left comments on 3 and 4 June, they asked me to scrap data from several possible websites (despite of the legal issues of scraping – HAD TO DO IT to serve people who had no other choice). Now I feel sad for not creating Australia Car Database earlier, its sales were slightly higher with sales of India Car Database (India is 50 times more populous than Australia).
2017 database made by me alone
Most of the databases I am selling are made using an “universal” scraper written in VB.NET by a Pakistan programmer, but due to anti-scraping measures on the Australian car website, his scraper don’t work so I had to look online for another scraper and found Octoparse.com, a FREE scraper slow and buggy, crashing a lot, forcing me to run it in small batches of about 1000-2000 cars and assemble them, scraping took about 2 weeks, and published database on 16 June 2017. From the initial 4 customers, only 1 ended purchasing, but new customers came and the number of sales in first months has been surprisingly high for a country with just 1/13 population of United States.
In November 2017, first customer asked for an update, and when tried to do it, I noticed that the source website removed META tags for year, make, model, so the only place where these essential information are displayed is page title, year, make, model, and badge are in a single field, requiring me to manually separate them after each update. Update takes 1-2 days and involving scraping last year of cars only. I had a huge luck to scrap all 90000+ cars in June 2017 to get make, model, year separately, automatically.
At April 2018 update the source website removed several data fields such as VIN, added them back few days later, removed again after 1 month, etc.
March 2019 database made with help from Australian student
In February 2019 I met a student from Australia who asked few questions about my business, I answered them thinking that he is a customer looking to buy a database, but turned a student, claiming that scraped from Redbook with Python and threatened me that if I don’t help him (help with what?) he will build his own website to sell car data via API, we finally agreed to work together rather than competing each other.
He gave me a BETA .py scraper which was not working (I paid for it) and said that if I have more customers interested in web scraping I can pass to him, and I did passed one (probably paid him another $500).
I told him to fix and after few days he came with idea to host scraper on AWS and let script run automatically on schedule, he gave me me username/password where I could export data as CSV and sell via my website. I did this, providing an update for all my customers on 24 March (118 columns) and promising to all customers monthly updates that involve re-scraping ALL 1960-present cars, not just latest year, Private price guide and Trade in price guide will be updated accordingly.
2 customers reported missing data for most cars in Standard / Optional Equipment columns. I asked student to fix errors, on 18 April 2019 he replied “I’ve been really busy lately with Uni exams, interviews for jobs and the insurance project. Sorry Ill get to the Australian database as soon as I can” and that was the LAST day I heard from him. He did not signed in anymore. AWS account was suspended, probably because AWS offer free service and bill you at end of month, and he did not paid bills, and all what I had was a non-working BETA scraper. I think that he may also have died in car crash, he told me that passed driving exam recently and given by the fact that his dad owned a business, he may have got a very powerful car.
While I can update database again using my old method (96 columns only, without average trade-in value, without colors and features, etc), adding new cars and not updating older records, I hope to find another Python expert to correct his scraper and make it running properly.
July 2019 database made with help from Indian programmer
I found someone in India experienced in Python, I paid him $50 and $100 for two small projects that he done successfully, then in July 2019 I gave him the BETA scraper from Australian student and we agreed $250 to fix it… he said that it is the most difficult project he ever done in Python, he also made many errors, we hardly met online at same time, only in September 2019 I can say that he fixed most errors after paying another $150 and I can extract most of data from the source website. I decided to provide you a TEMPORARY update with 2019 models beside the March 2019 update 1960-2019, and I ask all my customers to report errors so I can tell programmer to fix them before scraping all 1960-2019 cars with this new scraper, so whole database will be harmonized.
When tried to scrap 1960s cars, I noticed that scraper breaks when come across a car with NO image, had to pay again the indian programmer to fix it, then I found more errors. His experience is very low… I need to mention that he offered me to sell for $$ a couple of databases, which in October I figured out that “his portfolio” of databases was mostly with databases available for free download on various sites, but he lied me that created them himself and this mislead me that he is very experienced (many customers I told this story to, told me that he is really idiot and scammer to lie and charge money for databases he did not created). I spent sooo much time testing his scraper and reporting errors at Australian database, that caused customers of European and American car databases to complain for delay of updates for their databases. So I need to take a break from Australia project to serve Americans.
As 5 November I provided another update with 2019 models, and can re-scrap 2019 models anytime you require. The 1960-2018 cars I will re-scrap when I will be less busy. Need to mention that source website have an IP blocker which was not present in March 2019 when Australian student done same job, due to this IP blocker I cannot leave scraper running 24/7 and have data ready in few days, I need to actively monitor scraper and run in small batches.
Been waiting another 2 months and paid extra $ to fix his own errors, he fixed last reported error on 19 November, my turn came… to run scraper for all 1960-2019 cars, but I was getting IP blocked often, and the cars scraped until getting blocked were not saved unless batch of URLs was completed, had to run in small batches of max 5000 URLs and change IP 2 times per day. Due to this reason, it was NOT an ideal moment to run other scrapers, I delayed updates of car databases for other countries in order to help YOU, customer of Australia car database!
Due to complaints from American customers regarding delayed updates of American car database, I had to take a break from scraping Australia on 5-10 December. And on 17 December I scraped last batch of cars, and I joined all 23 files into one and published updated database!
On 19 December I emailed all customers to inform that I finalized update for Australia car database with new scraper made by indian programmer. I mentioned in email that 11 columns have no data due to typos in codes (is about columns with rare data available for less than 5% of cars).
But there was one more MAJOR MISTAKE that (unfortunately) I did not noticed, because the scraping process produced cars in somewhat random order. I joined 23 CSV files into one Excel and only in the LAST MINUTE, I sorted table alphabetically and immediately emailed everyone, without checking Excel file further (alphabetic sort clearly showed error), Christmas was coming and I was in hurry to finalize updates for 9 databases.
Even if I never coded in Python before, took me only few hours to fix myself his code that he did not fixed in 4 months: I fixed codes that generated wrong car names and empty columns, except Colors and Optional Equipment (Indian programmer want another $100 for this and promise doing in 3 days. Are you fucking serious? Any skilled programmer wouldn’t spend more than 1 hour on this! I already waited for 4 months), for which I asked few friends, they promised to help me after holidays… I was planning a complete re-scraping of all 100,000 cars which would take 20 days.
For moment I did not realized that I could quickly fix wrong car names… copy-pasting from other sheet made by me previously.
I emailed again everyone on 10 January to download CORRECTED database.
List of updates
122 makes, 92885 cars, as 16 June 2017. Initial launch.
123 makes, 93901 cars, as 20 November 2017. After about 10 sales, saw first person who ask for an update.
123 makes, 96604 cars, as 20 April 2018 I planned to update every 3 months, but the source website wasn’t working in February and March, thus only in April been able to do the update.
123 makes, 97610 cars, as 30 August 2018.
123 makes, 98225 cars, as 10 December 2018.
127? makes, 99925 cars, as 24 February 2019, a test with new scraper, not published.
124 makes, 100492 cars, as 24 March 2019, published with 118 columns instead of 96.
3002 cars (2019 models only), temporary update as 18 September 2019.
3339 cars (2019 models only), temporary update as 5 November 2019.
125 makes, 1549 models, 13069 model years, 103084 model versions, as 17 December 2019, all data re-scraped.