Press enter or click to view image in full size
I built an Instagram scraper in three days.
Over the next six weeks, I spent 48 hours keeping it alive.
That ratio tells you almost everything you need to know about scraping modern, heavily protected websites.
The first version worked well enough to create a false sense of success. It could fetch profiles, extract posts, collect comments, and store the results. In development, it looked finished. In practice, it was only the beginning. Once the scraper started running against a live, changing platform, the real engineering work began: stale selectors, rate limits, degraded sessions, blocked IPs, JavaScript-rendered content, and constant low-level breakage that turned a small utility into an ongoing maintenance burden.
This article is not a tutorial on how to scrape Instagram. It is a postmortem on what broke after the first version worked, how long each class of failure took to fix, and why a scraper that works in development is very different from one that survives in production.
The first implementation included the usual pieces:
That was enough to get the project off the ground. In the first few days, I was able to scrape a few hundred profiles without serious issues. The output looked clean, the code was structured, and the project felt stable enough to move on.
My commit message at that point was:
feat: initial scraper implementation - profiles, posts, and comments extractionIn retrospect, that was not a production scraper. It was a working prototype with a narrow operating window.
The first real failure was quiet.
The scraper still ran, but the output started degrading. Profiles came back as null. Post arrays were empty. Comment extraction silently failed. Nothing crashed, which made the problem more dangerous: if you only looked at process health, the scraper appeared fine.
The root cause was stale extraction logic. Instagram had changed enough of the DOM structure that the selectors I depended on no longer matched the right nodes. Class names had shifted, attributes had disappeared, and some data paths were no longer where I expected them to be.
This took 8 hours to fix in total:
The commit looked simple:
fix: update CSS selectors after DOM changeThis was the first reminder that in scraping, successful execution does not mean successful extraction.
The next break was much more obvious.
Requests started failing with 429 Too Many Requests. Authentication became inconsistent. Some sessions stopped persisting correctly, and the account/IP combination I was using was clearly being treated differently than before.
This forced changes in several places at once:
I ended up spending 10 hours on this round of fixes:
The resulting commits were:
fix: add exponential backoff + request throttling
chore: implement basic IP rotationThis was the point where the project stopped feeling like a parser and started feeling like infrastructure. Once request volume matters, scraping becomes less about data extraction and more about surviving the environment around it.
By the next phase, a different class of failure appeared.
Requests that had previously returned useful page content started returning mostly empty HTML shells: scripts, placeholders, and very little real data. The page rendered correctly in a browser, but the raw response no longer contained enough information to parse.
That usually means one thing: more of the content path has moved behind client-side rendering.
To keep extracting data, I had to add a JavaScript execution layer and move part of the scraper to a headless browser workflow. That changed the project significantly. It increased runtime cost, added another maintenance surface, and introduced a whole new set of ways to get blocked.
That migration took 12 hours:
The commit:
feat: migrate to headless browser for JS renderingThis was not a patch. It was an architectural change.
After the major breakages were addressed, the scraper did not become stable. It became fragile in smaller, more exhausting ways.
At that point, the issues were less dramatic but more persistent:
None of these problems looked catastrophic on their own. Together, they created a system that demanded constant attention.
Over the next two weeks, I spent another 18 hours on recurring maintenance:
At this stage, adding new features was no longer the hard part. Keeping the scraper alive was.
Join Medium for free to get updates from this writer.
That is where many scraping projects become a poor engineering tradeoff. They do not fail all at once. They slowly consume more time than they justify.
By the end of the six-week maintenance period, the 48 hours broke down like this:
Press enter or click to view image in full size
That was double the original build time.
Looking back, the failures fell into a few predictable categories.
Front-end changes do not need to be dramatic to break a scraper. A renamed class, a removed attribute, or a slightly different nesting structure can be enough to make extraction silently fail.
This is one of the most common breakages because extraction logic is usually tightly coupled to assumptions about page structure.
The safe request envelope is often narrower than it seems during development.
A scraper may work for a few hundred requests in testing, then degrade quickly once request volume, concurrency, or timing patterns become more predictable. Once you start hitting thresholds, response quality drops fast.
Authenticated scraping is rarely set-and-forget.
Cookies expire, session state degrades, login flows change, and retries can trigger secondary issues if they are too aggressive. Once authentication becomes unreliable, the rest of the pipeline becomes unreliable with it.
Not all IPs are equal.
Some fail immediately. Others work for a while and then degrade. Even with rotation, low-quality IPs or heavily reused ranges tend to become liabilities over time. Good extraction logic cannot compensate for bad network reputation.
A scraper may send correct requests and still look automated.
Timing patterns, navigation sequences, header order, TLS characteristics, and rendering behavior all contribute to whether traffic looks like a real user or a scripted system. Once browser automation enters the stack, this gets even harder to manage.
More websites are shipping less meaningful HTML in the initial response and relying on client-side rendering to assemble page content. That means an HTTP-based scraper can fail even when access is technically possible.
Once JavaScript execution becomes necessary, the scope of the project changes.
A scraper can work reliably in a test window and still be nowhere near production-ready.
That is because “working” in development usually means:
Production changes the problem. Now the scraper has to stay correct over time, under changing conditions, while continuing to deliver reliable data. That is not just parsing. That is infrastructure.
In this case, the numbers were simple:
Using a rough engineering cost of $100/hour, that comes out to:
And even that understates the true cost, because it does not capture interruption, context switching, roadmap delays, or the drag of reactive engineering work.
The biggest cost was not difficulty. It was attention.
Every maintenance cycle pulled time away from actual product work. Instead of improving features, strengthening downstream systems, or shipping new capabilities, I was debugging selectors, sessions, retries, proxies, and rendering issues.
That is the part teams often underestimate. Scraping infrastructure does not just consume engineering hours. It competes directly with the roadmap.
If scraping is your core product, that may be a justified investment.
If it is a supporting capability, the economics look very different.
After this project, I think the decision is less about ideology and more about focus.
A DIY scraper gives you full control, but it also makes you responsible for everything:
An Instagram managed scraping API moves much of that operational burden to a provider whose job is to maintain the collection layer.
Here is the practical comparison:
Press enter or click to view image in full size
For teams that need the data but do not want to become specialists in adversarial browser automation, that tradeoff becomes rational very quickly.
There are still valid reasons to build in-house:
If the system is strategic and you have the engineering capacity to maintain it properly, building it yourself can make sense.
But if the target is a large, fast-changing, heavily protected platform, the long-term maintenance profile deserves at least as much attention as the initial build.
The first version of my Instagram scraper worked.
That turned out to be the least important fact about it.
What mattered was how quickly it started failing under real conditions, and how much engineering time it took to restore reliability every time it broke.
That is the distinction I would emphasize to anyone evaluating a scraping project:
A scraper that works in development is a parser.
A scraper that works in production is infrastructure.
And infrastructure is where the real cost begins.