René's URL Explorer Experiment


Title: GitHub - balascript/webcrawler: The main objective of the project is to crawl web pages, store them, remove noise in them and finally verify the Zipf’s law distribution with the obtained noise-free content.

Open Graph Title: GitHub - balascript/webcrawler: The main objective of the project is to crawl web pages, store them, remove noise in them and finally verify the Zipf’s law distribution with the obtained noise-free content.

X Title: GitHub - balascript/webcrawler: The main objective of the project is to crawl web pages, store them, remove noise in them and finally verify the Zipf’s law distribution with the obtained noise-free content.

Description: The main objective of the project is to crawl web pages, store them, remove noise in them and finally verify the Zipf’s law distribution with the obtained noise-free content. - balascript/webcrawler

Open Graph Description: The main objective of the project is to crawl web pages, store them, remove noise in them and finally verify the Zipf’s law distribution with the obtained noise-free content. - balascript/webcrawler

X Description: The main objective of the project is to crawl web pages, store them, remove noise in them and finally verify the Zipf’s law distribution with the obtained noise-free content. - balascript/webcrawler

Opengraph URL: https://github.com/balascript/webcrawler

X: @github

direct link

Domain: patch-diff.githubusercontent.com

route-pattern/:user_id/:repository
route-controllerfiles
route-actiondisambiguate
fetch-noncev2:7407e724-1251-5f6e-ff23-0162c7b89584
current-catalog-service-hashf3abb0cc802f3d7b95fc8762b94bdcb13bf39634c40c357301c4aa1d67a256fb
request-idC200:E29E5:1336C4:19D6C4:697BDE6E
html-safe-nonce73bfc1c19caedff5c02813116ca1f6e7086c0af652931030914a5b49386b78cd
visitor-payloadeyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJDMjAwOkUyOUU1OjEzMzZDNDoxOUQ2QzQ6Njk3QkRFNkUiLCJ2aXNpdG9yX2lkIjoiNjY0Mjg1MDc5OTc5MTM2NTc0MiIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9
visitor-hmacab2e8f007e6a66fb852ab4ed927a4272a1fa392005a56b86c447d6b9960440aa
hovercard-subject-tagrepository:58023703
github-keyboard-shortcutsrepository,copilot
google-site-verificationApib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-urlhttps://collector.github.com/github/collect
analytics-location//
fb:app_id1401488693436528
apple-itunes-appapp-id=1477376905, app-argument=https://github.com/balascript/webcrawler
twitter:imagehttps://opengraph.githubassets.com/5f4ee88ef38ebf1e43d406cca7b223cd1dafa1a3c25836600453bd168272142f/balascript/webcrawler
twitter:cardsummary_large_image
og:imagehttps://opengraph.githubassets.com/5f4ee88ef38ebf1e43d406cca7b223cd1dafa1a3c25836600453bd168272142f/balascript/webcrawler
og:image:altThe main objective of the project is to crawl web pages, store them, remove noise in them and finally verify the Zipf’s law distribution with the obtained noise-free content. - balascript/webcrawler
og:image:width1200
og:image:height600
og:site_nameGitHub
og:typeobject
hostnamegithub.com
expected-hostnamegithub.com
None90c903d424d2820430e2681fde6506e2fde6c951096a52676303d6013be58916
turbo-cache-controlno-preview
go-importgithub.com/balascript/webcrawler git https://github.com/balascript/webcrawler.git
octolytics-dimension-user_id9585714
octolytics-dimension-user_loginbalascript
octolytics-dimension-repository_id58023703
octolytics-dimension-repository_nwobalascript/webcrawler
octolytics-dimension-repository_publictrue
octolytics-dimension-repository_is_forkfalse
octolytics-dimension-repository_network_root_id58023703
octolytics-dimension-repository_network_root_nwobalascript/webcrawler
turbo-body-classeslogged-out env-production page-responsive
disable-turbofalse
browser-stats-urlhttps://api.github.com/_private/browser/stats
browser-errors-urlhttps://api.github.com/_private/browser/errors
release450bbb9bf815634a4d5a0257514b0fb15fa5b213
ui-targetfull
theme-color#1e2327
color-schemelight dark

Links:

Skip to contenthttps://patch-diff.githubusercontent.com/balascript/webcrawler#start-of-content
https://patch-diff.githubusercontent.com/
Sign in https://patch-diff.githubusercontent.com/login?return_to=https%3A%2F%2Fgithub.com%2Fbalascript%2Fwebcrawler
GitHub CopilotWrite better code with AIhttps://github.com/features/copilot
GitHub SparkBuild and deploy intelligent appshttps://github.com/features/spark
GitHub ModelsManage and compare promptshttps://github.com/features/models
MCP RegistryNewIntegrate external toolshttps://github.com/mcp
ActionsAutomate any workflowhttps://github.com/features/actions
CodespacesInstant dev environmentshttps://github.com/features/codespaces
IssuesPlan and track workhttps://github.com/features/issues
Code ReviewManage code changeshttps://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilitieshttps://github.com/security/advanced-security
Code securitySecure your code as you buildhttps://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they starthttps://github.com/security/advanced-security/secret-protection
Why GitHubhttps://github.com/why-github
Documentationhttps://docs.github.com
Bloghttps://github.blog
Changeloghttps://github.blog/changelog
Marketplacehttps://github.com/marketplace
View all featureshttps://github.com/features
Enterpriseshttps://github.com/enterprise
Small and medium teamshttps://github.com/team
Startupshttps://github.com/enterprise/startups
Nonprofitshttps://github.com/solutions/industry/nonprofits
App Modernizationhttps://github.com/solutions/use-case/app-modernization
DevSecOpshttps://github.com/solutions/use-case/devsecops
DevOpshttps://github.com/solutions/use-case/devops
CI/CDhttps://github.com/solutions/use-case/ci-cd
View all use caseshttps://github.com/solutions/use-case
Healthcarehttps://github.com/solutions/industry/healthcare
Financial serviceshttps://github.com/solutions/industry/financial-services
Manufacturinghttps://github.com/solutions/industry/manufacturing
Governmenthttps://github.com/solutions/industry/government
View all industrieshttps://github.com/solutions/industry
View all solutionshttps://github.com/solutions
AIhttps://github.com/resources/articles?topic=ai
Software Developmenthttps://github.com/resources/articles?topic=software-development
DevOpshttps://github.com/resources/articles?topic=devops
Securityhttps://github.com/resources/articles?topic=security
View all topicshttps://github.com/resources/articles
Customer storieshttps://github.com/customer-stories
Events & webinarshttps://github.com/resources/events
Ebooks & reportshttps://github.com/resources/whitepapers
Business insightshttps://github.com/solutions/executive-insights
GitHub Skillshttps://skills.github.com
Documentationhttps://docs.github.com
Customer supporthttps://support.github.com
Community forumhttps://github.com/orgs/community/discussions
Trust centerhttps://github.com/trust-center
Partnershttps://github.com/partners
GitHub SponsorsFund open source developershttps://github.com/sponsors
Security Labhttps://securitylab.github.com
Maintainer Communityhttps://maintainers.github.com
Acceleratorhttps://github.com/accelerator
Archive Programhttps://archiveprogram.github.com
Topicshttps://github.com/topics
Trendinghttps://github.com/trending
Collectionshttps://github.com/collections
Enterprise platformAI-powered developer platformhttps://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security featureshttps://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI featureshttps://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 supporthttps://github.com/premium-support
Pricinghttps://github.com/pricing
Search syntax tipshttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentationhttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in https://patch-diff.githubusercontent.com/login?return_to=https%3A%2F%2Fgithub.com%2Fbalascript%2Fwebcrawler
Sign up https://patch-diff.githubusercontent.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E&source=header-repo&source_repo=balascript%2Fwebcrawler
Reloadhttps://patch-diff.githubusercontent.com/balascript/webcrawler
Reloadhttps://patch-diff.githubusercontent.com/balascript/webcrawler
Reloadhttps://patch-diff.githubusercontent.com/balascript/webcrawler
balascript https://patch-diff.githubusercontent.com/balascript
webcrawlerhttps://patch-diff.githubusercontent.com/balascript/webcrawler
Notifications https://patch-diff.githubusercontent.com/login?return_to=%2Fbalascript%2Fwebcrawler
Fork 0 https://patch-diff.githubusercontent.com/login?return_to=%2Fbalascript%2Fwebcrawler
Star 0 https://patch-diff.githubusercontent.com/login?return_to=%2Fbalascript%2Fwebcrawler
MIT license https://patch-diff.githubusercontent.com/balascript/webcrawler/blob/master/LICENSE
0 stars https://patch-diff.githubusercontent.com/balascript/webcrawler/stargazers
0 forks https://patch-diff.githubusercontent.com/balascript/webcrawler/forks
Branches https://patch-diff.githubusercontent.com/balascript/webcrawler/branches
Tags https://patch-diff.githubusercontent.com/balascript/webcrawler/tags
Activity https://patch-diff.githubusercontent.com/balascript/webcrawler/activity
Star https://patch-diff.githubusercontent.com/login?return_to=%2Fbalascript%2Fwebcrawler
Notifications https://patch-diff.githubusercontent.com/login?return_to=%2Fbalascript%2Fwebcrawler
Code https://patch-diff.githubusercontent.com/balascript/webcrawler
Issues 0 https://patch-diff.githubusercontent.com/balascript/webcrawler/issues
Pull requests 0 https://patch-diff.githubusercontent.com/balascript/webcrawler/pulls
Actions https://patch-diff.githubusercontent.com/balascript/webcrawler/actions
Projects 0 https://patch-diff.githubusercontent.com/balascript/webcrawler/projects
Security 0 https://patch-diff.githubusercontent.com/balascript/webcrawler/security
Insights https://patch-diff.githubusercontent.com/balascript/webcrawler/pulse
Code https://patch-diff.githubusercontent.com/balascript/webcrawler
Issues https://patch-diff.githubusercontent.com/balascript/webcrawler/issues
Pull requests https://patch-diff.githubusercontent.com/balascript/webcrawler/pulls
Actions https://patch-diff.githubusercontent.com/balascript/webcrawler/actions
Projects https://patch-diff.githubusercontent.com/balascript/webcrawler/projects
Security https://patch-diff.githubusercontent.com/balascript/webcrawler/security
Insights https://patch-diff.githubusercontent.com/balascript/webcrawler/pulse
Brancheshttps://patch-diff.githubusercontent.com/balascript/webcrawler/branches
Tagshttps://patch-diff.githubusercontent.com/balascript/webcrawler/tags
https://patch-diff.githubusercontent.com/balascript/webcrawler/branches
https://patch-diff.githubusercontent.com/balascript/webcrawler/tags
6 Commitshttps://patch-diff.githubusercontent.com/balascript/webcrawler/commits/master/
https://patch-diff.githubusercontent.com/balascript/webcrawler/commits/master/
resourceshttps://patch-diff.githubusercontent.com/balascript/webcrawler/tree/master/resources
resourceshttps://patch-diff.githubusercontent.com/balascript/webcrawler/tree/master/resources
src/edu/scu/java/webcrawlhttps://patch-diff.githubusercontent.com/balascript/webcrawler/tree/master/src/edu/scu/java/webcrawl
src/edu/scu/java/webcrawlhttps://patch-diff.githubusercontent.com/balascript/webcrawler/tree/master/src/edu/scu/java/webcrawl
.gitignorehttps://patch-diff.githubusercontent.com/balascript/webcrawler/blob/master/.gitignore
.gitignorehttps://patch-diff.githubusercontent.com/balascript/webcrawler/blob/master/.gitignore
LICENSEhttps://patch-diff.githubusercontent.com/balascript/webcrawler/blob/master/LICENSE
LICENSEhttps://patch-diff.githubusercontent.com/balascript/webcrawler/blob/master/LICENSE
README.mdhttps://patch-diff.githubusercontent.com/balascript/webcrawler/blob/master/README.md
README.mdhttps://patch-diff.githubusercontent.com/balascript/webcrawler/blob/master/README.md
config.csvhttps://patch-diff.githubusercontent.com/balascript/webcrawler/blob/master/config.csv
config.csvhttps://patch-diff.githubusercontent.com/balascript/webcrawler/blob/master/config.csv
pom.xmlhttps://patch-diff.githubusercontent.com/balascript/webcrawler/blob/master/pom.xml
pom.xmlhttps://patch-diff.githubusercontent.com/balascript/webcrawler/blob/master/pom.xml
READMEhttps://patch-diff.githubusercontent.com/balascript/webcrawler
MIT licensehttps://patch-diff.githubusercontent.com/balascript/webcrawler
https://patch-diff.githubusercontent.com/balascript/webcrawler#introduction
https://patch-diff.githubusercontent.com/balascript/webcrawler#architecture
https://patch-diff.githubusercontent.com/balascript/webcrawler
https://patch-diff.githubusercontent.com/balascript/webcrawler#noise-removal
https://patch-diff.githubusercontent.com/balascript/webcrawler#challenges-faced
https://patch-diff.githubusercontent.com/balascript/webcrawler#report-html
https://patch-diff.githubusercontent.com/balascript/webcrawler#zipfs-law-verification
https://patch-diff.githubusercontent.com/balascript/webcrawler#appendix---b
www.scu.eduhttp://www.scu.edu
https://patch-diff.githubusercontent.com/balascript/webcrawler#references
http://jsoup.orghttp://jsoup.org/
https://github.com/TrigonicSolutions/jrobotxhttps://github.com/TrigonicSolutions/jrobotx
https://github.com/kohlschutter/boilerpipehttps://github.com/kohlschutter/boilerpipe
https://github.com/spullara/mustache.javahttps://github.com/spullara/mustache.java
http://opencsv.sourceforge.net/http://opencsv.sourceforge.net/
https://adblockplus.org/en/filters#basichttps://adblockplus.org/en/filters#basic
https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlahttps://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java
inText.javahttps://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java
https://github.com/google/guavahttps://github.com/google/guava
https://lucene.apache.org/core/4_0_0/https://lucene.apache.org/core/4_0_0/
https://commons.apache.org/proper/commons-validator/https://commons.apache.org/proper/commons-validator/
http://jsoup.org/cookbook/input/load-document-from-urlhttp://jsoup.org/cookbook/input/load-document-from-url
http://stackoverflow.com/questions/1600291/validating-url-in-javahttp://stackoverflow.com/questions/1600291/validating-url-in-java
http://howtodoinjava.com/core-java/multi-threading/how-to-work-with-wait-notify-andhttp://howtodoinjava.com/core-java/multi-threading/how-to-work-with-wait-notify-and-notifyall-in-java/
notifyall-in-java/http://howtodoinjava.com/core-java/multi-threading/how-to-work-with-wait-notify-and-notifyall-in-java/
http://stackoverflow.com/questions/3771081/proper-way-to-check-for-url-equalityhttp://stackoverflow.com/questions/3771081/proper-way-to-check-for-url-equality
http://stackoverflow.com/questions/35128092/threadgroup-in-javahttp://stackoverflow.com/questions/35128092/threadgroup-in-java
http://www.tutorialspoint.com/java/java_thread_communication.htmhttp://www.tutorialspoint.com/java/java_thread_communication.htm
http://www.roseindia.net/java/beginners/java-create-directory.shtmlhttp://www.roseindia.net/java/beginners/java-create-directory.shtml
http://stackoverflow.com/questions/27399419/giving-current-timestamp-as-folderhttp://stackoverflow.com/questions/27399419/giving-current-timestamp-as-folder-name-in-java
name-in-javahttp://stackoverflow.com/questions/27399419/giving-current-timestamp-as-folder-name-in-java
http://stackoverflow.com/questions/1922677/nullpointerexception-when-creating-anhttp://stackoverflow.com/questions/1922677/nullpointerexception-when-creating-an-array-of-objects
array-of-objectshttp://stackoverflow.com/questions/1922677/nullpointerexception-when-creating-an-array-of-objects
http://videolectures.net/wsdm2010_kohlschutter_bdu/http://videolectures.net/wsdm2010_kohlschutter_bdu/
http://www.philippeadjiman.com/blog/2009/10/26/drawing-the-long-tail-of-a-zipf-lawhttp://www.philippeadjiman.com/blog/2009/10/26/drawing-the-long-tail-of-a-zipf-law-using-gnuplot-java-and-moby-dick/
using-gnuplot-java-and-moby-dick/http://www.philippeadjiman.com/blog/2009/10/26/drawing-the-long-tail-of-a-zipf-law-using-gnuplot-java-and-moby-dick/
http://www.businessinsider.com/henry-blodget-bing-revisited-still-toast-but-slightlyhttp://www.businessinsider.com/henry-blodget-bing-revisited-still-toast-but-slightly-less-burnt-2010-3
less-burnt-2010-3http://www.businessinsider.com/henry-blodget-bing-revisited-still-toast-but-slightly-less-burnt-2010-3
http://stackoverflow.com/questions/5640334/how-do-i-preserve-line-breaks-whenhttp://stackoverflow.com/questions/5640334/how-do-i-preserve-line-breaks-when-using-jsoup-to-convert-html-to-plain-text
using-jsoup-to-convert-html-to-plain-texthttp://stackoverflow.com/questions/5640334/how-do-i-preserve-line-breaks-when-using-jsoup-to-convert-html-to-plain-text
http://hellospark.com/en-us/blog/invest-google-adwords-bing/http://hellospark.com/en-us/blog/invest-google-adwords-bing/
Readme https://patch-diff.githubusercontent.com/balascript/webcrawler#readme-ov-file
MIT license https://patch-diff.githubusercontent.com/balascript/webcrawler#MIT-1-ov-file
Please reload this pagehttps://patch-diff.githubusercontent.com/balascript/webcrawler
Activityhttps://patch-diff.githubusercontent.com/balascript/webcrawler/activity
0 starshttps://patch-diff.githubusercontent.com/balascript/webcrawler/stargazers
1 watchinghttps://patch-diff.githubusercontent.com/balascript/webcrawler/watchers
0 forkshttps://patch-diff.githubusercontent.com/balascript/webcrawler/forks
Report repository https://patch-diff.githubusercontent.com/contact/report-content?content_url=https%3A%2F%2Fgithub.com%2Fbalascript%2Fwebcrawler&report=balascript+%28user%29
Releaseshttps://patch-diff.githubusercontent.com/balascript/webcrawler/releases
Packages 0https://patch-diff.githubusercontent.com/users/balascript/packages?repo_name=webcrawler
Java 84.9% https://patch-diff.githubusercontent.com/balascript/webcrawler/search?l=java
HTML 15.1% https://patch-diff.githubusercontent.com/balascript/webcrawler/search?l=html
https://github.com
Termshttps://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacyhttps://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Securityhttps://github.com/security
Statushttps://www.githubstatus.com/
Communityhttps://github.community/
Docshttps://docs.github.com/
Contacthttps://support.github.com?tags=dotcom-footer

Viewport: width=device-width


URLs of crawlers that visited me.