René's URL Explorer Experiment
Title: Feature: Add file/directory exclusion feature with glob pattern support by fishmingyu · Pull Request #199 · sourcegraph/scip-python · GitHub
Open Graph Title: Feature: Add file/directory exclusion feature with glob pattern support by fishmingyu · Pull Request #199 · sourcegraph/scip-python
X Title: Feature: Add file/directory exclusion feature with glob pattern support by fishmingyu · Pull Request #199 · sourcegraph/scip-python
Description: Motivation
The motivation of the PR is in many repositories; we don't want to include some files, e.g., tests*)
and it may also include a file that would be either broken or meaningless. However, all these files will not only affect the processing time of pyright-scip, but also will cause abortion. One example I showed below is the failed log while I process the sympy repo. I also attached the success log after applying our new exclude pattern.
Summary
This PR adds the ability to exclude files and directories from SCIP indexing using command-line flags or a configuration file. The exclusion feature supports both exact paths and glob patterns (e.g., test_*), and works as a filter that gracefully handles non-matching patterns without errors.
Changes Made
1. MainCommand.ts
Added exclude?: string[] to IndexOptions interface
Added excludeConfig?: string to IndexOptions interface
Added --exclude flag to accept multiple file/directory paths
Added --exclude-config flag to accept a config file with exclusion paths
2. indexer.ts
Added import { minimatch } from 'minimatch' for glob pattern matching
Implemented exclusion logic after targetOnly filtering (lines 122-179)
Reads patterns from --exclude flag
Reads patterns from config file if --exclude-config is provided
Config file supports:
Pattern matching features:
Exact path matching (original functionality)
Glob patterns: dir*, file*, tests/**, etc.
Relative and absolute paths
Works as a filter - patterns matching nothing don't cause errors
3. package.json
Added minimatch dependency for glob pattern matching
Usage
Exclude specific files/directories via command line:*
scip-python index --project-name=myproject --exclude path/to/broken.py --exclude path/to/circular/
Exclude using patterns:*
scip-python index --project-name=myproject --exclude "test_*" "build/**"
Exclude using a config file:*
scip-python index --project-name=myproject --exclude-config=.scipignore
Example config file format (.scipignore):*
# Broken files
src/broken_module.py
# Directories with circular dependencies
src/experimental/
tests/broken/
# Glob patterns
test_*
build/**
# Another problematic file
lib/legacy.py
Benefits
Flexibility: Supports both exact paths and glob patterns
Robustness: Works as a filter - no errors if patterns match nothing
Usability: Config file support for managing complex exclusion rules
Consistency: Follows the same pattern as existing --target-only flag
Testing
The feature can be tested by:
Using --exclude with exact paths
Using --exclude with glob patterns like test_*
Using --exclude-config with a file containing mixed patterns and comments
Verifying that non-matching patterns don't cause errors
Log when directly indexing the sympy
(11:57:21) pyproject.toml file found at /home/zhongming/.codeminer/sympy_sympy.
(11:57:21) Loading pyproject.toml file at /home/zhongming/.codeminer/sympy_sympy/pyproject.toml
Assuming Python version 3.11
Assuming Python platform Linux
Auto-excluding **/node_modules
Auto-excluding **/**pycache**
Auto-excluding **/.*
(11:57:21) Total Project Files 1522
(11:57:21) Indexing /home/zhongming/.codeminer/sympy_sympy with version d293133e81194adc11177729af91c970f092a6e7
(11:57:21) Evaluating python environment dependencies
(11:57:21) Gathering environment information
(11:57:22) Parse and search for dependencies
(11:57:32) 152 / 1522
(11:57:43) 211 / 1522
(11:57:57) 377 / 1522
(11:58:07) 577 / 1522
(11:58:17) 864 / 1522
(11:58:27) 958 / 1522
(11:58:51) 1084 / 1522
(11:59:01) 1276 / 1522
(11:59:11) 1419 / 1522
(11:59:14) Index workspace and track project files
(11:59:14) Analyze project and dependencies
(11:59:26) 76 / 1524
(11:59:37) 114 / 1524
(11:59:47) 165 / 1524
(11:59:57) 224 / 1524
(12:00:08) 264 / 1524
(12:00:33) 301 / 1524
(12:00:43) 432 / 1524
(12:00:53) 477 / 1524
(12:01:03) 526 / 1524
(12:01:13) 584 / 1524
(12:01:25) 614 / 1524
(12:01:37) 642 / 1524
<--- Last few GCs --->
[2024902:0x7c30a30] 258240 ms: Mark-Compact 3985.9 (4128.9) -> 3970.3 (4129.4) MB, 1982.51 / 0.00 ms (average mu = 0.180, current mu = 0.020) allocation failure; scavenge might not succeed
[2024902:0x7c30a30] 260659 ms: Mark-Compact 3986.5 (4129.4) -> 3970.9 (4129.9) MB, 2370.20 / 0.00 ms (average mu = 0.103, current mu = 0.020) allocation failure; scavenge might not succeed
<--- JS stacktrace --->
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
----- Native stack trace -----
1: 0xb8d0a3 node::OOMErrorHandler(char const*, v8::OOMDetails const&) [node]
2: 0xf06250 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node]
3: 0xf06537 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node]
4: 0x11180d5 [node]
5: 0x1118664 v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [node]
6: 0x112f554 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::internal::GarbageCollectionReason, char const*) [node]
7: 0x112fd6c v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
8: 0x1106071 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
9: 0x1107205 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
10: 0x10e4856 v8::internal::Factory::NewFillerObject(int, v8::internal::AllocationAlignment, v8::internal::AllocationType, v8::internal::AllocationOrigin) [node]
11: 0x1540686 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [node]
12: 0x7ecdc3cd9ef6
Log after the exclude feature applied
INFO Running in conda environment: ['scip-python', 'index', '--cwd', '/home/zhongming/.codeminer/sympy_sympy',
'--project-name', 'test_swebench', '--output', '/home/zhongming/.codeminer/sympy__sympy-27223/index.scip',
'--exclude', 'sympy/polys/numberfields/resolvent_lookup.py', '--exclude', 'test_*']
(13:28:16) No configuration file found.
(13:28:16) pyproject.toml file found at /home/zhongming/.codeminer/sympy_sympy.
(13:28:16) Loading pyproject.toml file at /home/zhongming/.codeminer/sympy_sympy/pyproject.toml
Assuming Python version 3.11
Assuming Python platform Linux
Auto-excluding **/node_modules
Auto-excluding **/**pycache**
Auto-excluding **/.*
(13:28:16) Total Project Files 915
(13:28:16) Indexing /home/zhongming/.codeminer/sympy_sympy with version d293133e81194adc11177729af91c970f092a6e7
(13:28:16) Evaluating python environment dependencies
(13:28:17) Gathering environment information
(13:28:17) Parse and search for dependencies
(13:28:28) 101 / 915
(13:28:43) 226 / 915
(13:28:53) 515 / 915
(13:29:04) 591 / 915
(13:29:14) 786 / 915
(13:29:17) Index workspace and track project files
(13:29:17) Analyze project and dependencies
(13:29:27) 28 / 917
(13:29:37) 145 / 917
(13:29:49) 163 / 917
(13:29:59) 480 / 917
(13:30:11) 508 / 917
(13:30:21) 684 / 917
(13:30:28) Parse and emit SCIP
(13:30:29) - (14/916): /home/zhongming/.codeminer/sympy_sympy/sympy/assumptions/facts.py
(13:30:30) - (48/916): /home/zhongming/.codeminer/sympy_sympy/sympy/calculus/tests/**init**.py
(13:30:31) - (74/916): /home/zhongming/.codeminer/sympy_sympy/sympy/combinatorics/free_groups.py
(13:30:33) - (85/916): /home/zhongming/.codeminer/sympy_sympy/sympy/combinatorics/permutations.py
(13:30:34) - (120/916): /home/zhongming/.codeminer/sympy_sympy/sympy/core/expr.py
(13:30:35) - (129/916): /home/zhongming/.codeminer/sympy_sympy/sympy/core/multidimensional.py
(13:30:36) - (177/916): /home/zhongming/.codeminer/sympy_sympy/sympy/functions/elementary/exponential.py
(13:30:38) - (218/916): /home/zhongming/.codeminer/sympy_sympy/sympy/holonomic/holonomicerrors.py
(13:30:39) - (264/916): /home/zhongming/.codeminer/sympy_sympy/sympy/logic/inference.py
(13:30:40) - (339/916): /home/zhongming/.codeminer/sympy_sympy/sympy/ntheory/generate.py
(13:30:41) - (355/916): /home/zhongming/.codeminer/sympy_sympy/sympy/parsing/autolev/_listener_autolev_antlr.py
(13:30:42) - (381/916): /home/zhongming/.codeminer/sympy_sympy/sympy/parsing/latex/**init**.py
(13:30:43) - (474/916): /home/zhongming/.codeminer/sympy_sympy/sympy/physics/quantum/qft.py
(13:30:44) - (573/916): /home/zhongming/.codeminer/sympy_sympy/sympy/polys/polyconfig.py
(13:30:46) - (581/916): /home/zhongming/.codeminer/sympy_sympy/sympy/polys/polyutils.py
(13:30:47) - (654/916): /home/zhongming/.codeminer/sympy_sympy/sympy/polys/numberfields/galoisgroups.py
(13:30:48) - (699/916): /home/zhongming/.codeminer/sympy_sympy/sympy/printing/pretty/pretty_symbology.py
(13:30:49) - (771/916): /home/zhongming/.codeminer/sympy_sympy/sympy/solvers/solveset.py
(13:30:50) - (808/916): /home/zhongming/.codeminer/sympy_sympy/sympy/stats/sampling/**init**.py
(13:30:51) - (832/916): /home/zhongming/.codeminer/sympy_sympy/sympy/tensor/toperators.py
(13:30:52) - (902/916): /home/zhongming/.codeminer/sympy_sympy/sympy/vector/deloperator.py
(13:30:53) Writing external symbols to SCIP index
(13:30:53) Sucessfully wrote SCIP index to /home/zhongming/.codeminer/sympy__sympy-27223/index.scip
Open Graph Description: Motivation
The motivation of the PR is in many repositories; we don't want to include some files, e.g., tests*)
and it may also include a file that would be either broken or meaningless. Howeve...
X Description: Motivation
The motivation of the PR is in many repositories; we don't want to include some files, e.g., tests*)
and it may also include a file that would be either broken or meaningless. Ho...
Opengraph URL: https://github.com/sourcegraph/scip-python/pull/199
X: @github
direct link
Domain: patch-diff.githubusercontent.com
| route-pattern | /:user_id/:repository/pull/:id/files(.:format) |
| route-controller | pull_requests |
| route-action | files |
| fetch-nonce | v2:baa04c04-0f2c-b0a8-26fc-1e68e85a9f44 |
| current-catalog-service-hash | ae870bc5e265a340912cde392f23dad3671a0a881730ffdadd82f2f57d81641b |
| request-id | D450:22289C:6D49B9:8F255A:6991E9FD |
| html-safe-nonce | b4bb63516266bd2151f7896173713a19d9e935f25f4e53c43ab0f46b010fce5e |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJENDUwOjIyMjg5Qzo2RDQ5Qjk6OEYyNTVBOjY5OTFFOUZEIiwidmlzaXRvcl9pZCI6IjU2MDU2MzcyMjE2NzM3MjQ0MTMiLCJyZWdpb25fZWRnZSI6ImlhZCIsInJlZ2lvbl9yZW5kZXIiOiJpYWQifQ== |
| visitor-hmac | 0c86ca52d52a360784ff9add90049529dc9a1689d25d0217659455811082020e |
| hovercard-subject-tag | pull_request:2905652842 |
| github-keyboard-shortcuts | repository,pull-request-list,pull-request-conversation,pull-request-files-changed,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | ///pull_requests/show/files |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/sourcegraph/scip-python/pull/199/files |
| twitter:image | https://avatars.githubusercontent.com/u/46808682?s=400&v=4 |
| twitter:card | summary_large_image |
| og:image | https://avatars.githubusercontent.com/u/46808682?s=400&v=4 |
| og:image:alt | Motivation
The motivation of the PR is in many repositories; we don't want to include some files, e.g., tests*)
and it may also include a file that would be either broken or meaningless. Howeve... |
| og:site_name | GitHub |
| og:type | object |
| hostname | github.com |
| expected-hostname | github.com |
| None | 42c603b9d642c4a9065a51770f75e5e27132fef0e858607f5c9cb7e422831a7b |
| turbo-cache-control | no-preview |
| diff-view | unified |
| go-import | github.com/sourcegraph/scip-python git https://github.com/sourcegraph/scip-python.git |
| octolytics-dimension-user_id | 3979584 |
| octolytics-dimension-user_login | sourcegraph |
| octolytics-dimension-repository_id | 443942523 |
| octolytics-dimension-repository_nwo | sourcegraph/scip-python |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 443942523 |
| octolytics-dimension-repository_network_root_nwo | sourcegraph/scip-python |
| turbo-body-classes | logged-out env-production page-responsive full-width |
| disable-turbo | true |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 848bc6032dcc93a9a7301dcc3f379a72ba13b96e |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width
URLs of crawlers that visited me.